[PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents

Zhang Yi posted 13 patches 1 week, 3 days ago
There is a newer version of this series
fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
fs/ext4/inode.c          |  18 +++---
3 files changed, 176 insertions(+), 90 deletions(-)
[PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Zhang Yi 1 week, 3 days ago
Changes since v1:
 - Rebase the codes based on the latest linux-next 20251120.
 - Add patches 01-05, fix two stale data problems caused by
   EXT4_EXT_MAY_ZEROOUT when splitting extent.
 - Add patches 06-07, fix two stale extent status entries problems also
   caused by splitting extent.
 - Modify patches 08-10, extend __es_remove_extent() and
   ext4_es_cache_extent() to allow them to overwrite existing extents of
   the same status when caching on-disk extents, while also checking
   extents of different stauts and raising alarms to prevent misuse.
 - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
   remove the TODO comment in it.

v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/

Original Description

This series addresses the optimization that Jan pointed out [1]
regarding the introduction of a sequence number to
ext4_es_insert_extent(). The proposal is to replace all instances where
the cache of on-disk extents is updated by using ext4_es_cache_extent()
instead of ext4_es_insert_extent(). This change can prevent excessive
cache invalidations caused by unnecessarily increasing the extent
sequence number when reading from the on-disk extent tree.

[1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/

Cheers,
Yi.

Zhang Yi (13):
  ext4: cleanup zeroout in ext4_split_extent_at()
  ext4: subdivide EXT4_EXT_DATA_VALID1
  ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
  ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
    submitting I/O
  ext4: correct the mapping status if the extent has been zeroed
  ext4: don't cache extent during splitting extent
  ext4: drop extent cache before splitting extent
  ext4: cleanup useless out tag in __es_remove_extent()
  ext4: make __es_remove_extent() check extent status
  ext4: make ext4_es_cache_extent() support overwrite existing extents
  ext4: adjust the debug info in ext4_es_cache_extent()
  ext4: replace ext4_es_insert_extent() when caching on-disk extents
  ext4: drop the TODO comment in ext4_es_insert_extent()

 fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
 fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
 fs/ext4/inode.c          |  18 +++---
 3 files changed, 176 insertions(+), 90 deletions(-)

-- 
2.46.1
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Theodore Tso 3 days, 8 hours ago
Thanks to Jan and Ojaswin for their comments and reviews on this patch set.

As you may have noticed this version of the patchset is currently in
the ext4.git tree, and has shown up in linux-next.

Given that we have some suggested changes, there are a couple of ways
we can handle this:

1) Zhang Yi could produce a new version of the patch set, and I'll
replace the v2 version of the patch set currently in the ext4.git tree
with a newer version.

2) We could append fix-up patches to the ext4.git tree to reflect
those changes.

3) I could drop the patch set entirely and we apply a later version
of the patch series after the merge window.

What are folks' preferences?

Thanks,

						- Ted
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Zhang Yi 2 days, 23 hours ago
Hi, Ted!

On 11/29/2025 12:49 AM, Theodore Tso wrote:
> Thanks to Jan and Ojaswin for their comments and reviews on this patch set.
> 
> As you may have noticed this version of the patchset is currently in
> the ext4.git tree, and has shown up in linux-next.

The ext4.git tree[1] shows that only the first three patches from this
v2 version have been merged, possibly because the fourth patch conflicted
with ErKun's patch[2]. I've written a new version locally, adding two fix
patches for the three merged patches, and then rebase the subsequent
patches and modify them directly. I can send them out after texting.
Is this okay?

Thanks,
Yi.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
[2] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=dac092195b6a35bc7c9f11e2884cfecb1b25e20c

> 
> Given that we have some suggested changes, there are a couple of ways
> we can handle this:
> 
> 1) Zhang Yi could produce a new version of the patch set, and I'll
> replace the v2 version of the patch set currently in the ext4.git tree
> with a newer version.
> 
> 2) We could append fix-up patches to the ext4.git tree to reflect
> those changes.
> 
> 3) I could drop the patch set entirely and we apply a later version
> of the patch series after the merge window.
> 
> What are folks' preferences?
> 
> Thanks,
> 
> 						- Ted
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Theodore Tso 2 days, 21 hours ago
On Sat, Nov 29, 2025 at 09:32:25AM +0800, Zhang Yi wrote:
> 
> The ext4.git tree[1] shows that only the first three patches from this
> v2 version have been merged, possibly because the fourth patch conflicted
> with ErKun's patch[2].

Yeah, oops.  Sorry about that.  I think I had checked some other
branch via a git reset --hard, and forgot that I had accidentally
aborted the git am after the patch conflict.

I've rearranged the dev brach so that those first three patches are
not at the tip of the dev branch.  So if you want to grab the dev
branch, and then rebase your new series on top of commit 91ef18b567da,
you can do that.

* 1e366d088866 - (HEAD -> dev, ext4/dev) ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 (10 minutes ago)
* 6afae3414127 - ext4: subdivide EXT4_EXT_DATA_VALID1 (10 minutes ago)
* a4e2cc74a11b - ext4: cleanup zeroout in ext4_split_extent_at() (10 minutes ago)
* 91ef18b567da - ext4: mark inodes without acls in __ext4_iget() (10 minutes ago)

Note that the merge window opens on this coming Sunday, but a good
number of these patches are bug fixes (and not in the sense of adding
a new feature :-), so I'd prefer to land them this cycle if possible.

> I've written a new version locally, adding two fix
> patches for the three merged patches, and then rebase the subsequent
> patches and modify them directly. I can send them out after texting.
> Is this okay?

Why don't you modify your series so it applies on top of 91ef18b567da,
so we don't hace to have the fixup patches, and I may just simply push
everything up to 91ef18b567da to Linus, and we can then just apply the
next version of your series on top of that commit, and I'll push it to
Linus right after -rc1.

Does that seem workable to you?

							- Ted
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Zhang Yi 2 days, 20 hours ago
On 11/29/2025 11:52 AM, Theodore Tso wrote:
> On Sat, Nov 29, 2025 at 09:32:25AM +0800, Zhang Yi wrote:
>>
>> The ext4.git tree[1] shows that only the first three patches from this
>> v2 version have been merged, possibly because the fourth patch conflicted
>> with ErKun's patch[2].
> 
> Yeah, oops.  Sorry about that.  I think I had checked some other
> branch via a git reset --hard, and forgot that I had accidentally
> aborted the git am after the patch conflict.
> 
> I've rearranged the dev brach so that those first three patches are
> not at the tip of the dev branch.  So if you want to grab the dev
> branch, and then rebase your new series on top of commit 91ef18b567da,
> you can do that.
> 
> * 1e366d088866 - (HEAD -> dev, ext4/dev) ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 (10 minutes ago)
> * 6afae3414127 - ext4: subdivide EXT4_EXT_DATA_VALID1 (10 minutes ago)
> * a4e2cc74a11b - ext4: cleanup zeroout in ext4_split_extent_at() (10 minutes ago)
> * 91ef18b567da - ext4: mark inodes without acls in __ext4_iget() (10 minutes ago)
> 
> Note that the merge window opens on this coming Sunday, but a good
> number of these patches are bug fixes (and not in the sense of adding
> a new feature :-), so I'd prefer to land them this cycle if possible.
> 
>> I've written a new version locally, adding two fix
>> patches for the three merged patches, and then rebase the subsequent
>> patches and modify them directly. I can send them out after texting.
>> Is this okay?
> 
> Why don't you modify your series so it applies on top of 91ef18b567da,
> so we don't hace to have the fixup patches, and I may just simply push
> everything up to 91ef18b567da to Linus, and we can then just apply the
> next version of your series on top of that commit, and I'll push it to
> Linus right after -rc1.
> 
> Does that seem workable to you?
> 
> 							- Ted

Sure, I will rebase my series on top of 91ef18b567da and send out today.

Cheers,
Yi.
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Ojaswin Mujoo 1 week, 1 day ago
On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> Changes since v1:
>  - Rebase the codes based on the latest linux-next 20251120.
>  - Add patches 01-05, fix two stale data problems caused by

Hi Zhang, thanks for the patches.

I've always felt uncomfortable with the ZEROOUT code here because it
seems to have many such bugs as you pointed out in the series. Its very
fragile and the bugs are easy to miss behind all the data valid and
split flags mess. 

As per my understanding, ZEROOUT logic seems to be a special best-effort
try to make the split/convert operation "work" when dealing with
transient errors like ENOSPC etc. I was just wondering if it makes sense
to just get rid of the whole ZEROOUT logic completely and just reset the
extent to orig state if there is any error. This allows us to get rid of
DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
slightly less messy.

Maybe we can have a retry loop at the top level caller if we want to try
again for say ENOSPC or ENOMEM. 

Would love to hear your thoughts on it.

Thanks,
Ojaswin

>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
>  - Add patches 06-07, fix two stale extent status entries problems also
>    caused by splitting extent.
>  - Modify patches 08-10, extend __es_remove_extent() and
>    ext4_es_cache_extent() to allow them to overwrite existing extents of
>    the same status when caching on-disk extents, while also checking
>    extents of different stauts and raising alarms to prevent misuse.
>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>    remove the TODO comment in it.
> 
> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
> 
> Original Description
> 
> This series addresses the optimization that Jan pointed out [1]
> regarding the introduction of a sequence number to
> ext4_es_insert_extent(). The proposal is to replace all instances where
> the cache of on-disk extents is updated by using ext4_es_cache_extent()
> instead of ext4_es_insert_extent(). This change can prevent excessive
> cache invalidations caused by unnecessarily increasing the extent
> sequence number when reading from the on-disk extent tree.
> 
> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
> 
> Cheers,
> Yi.
> 
> Zhang Yi (13):
>   ext4: cleanup zeroout in ext4_split_extent_at()
>   ext4: subdivide EXT4_EXT_DATA_VALID1
>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>     submitting I/O
>   ext4: correct the mapping status if the extent has been zeroed
>   ext4: don't cache extent during splitting extent
>   ext4: drop extent cache before splitting extent
>   ext4: cleanup useless out tag in __es_remove_extent()
>   ext4: make __es_remove_extent() check extent status
>   ext4: make ext4_es_cache_extent() support overwrite existing extents
>   ext4: adjust the debug info in ext4_es_cache_extent()
>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
>   ext4: drop the TODO comment in ext4_es_insert_extent()
> 
>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>  fs/ext4/inode.c          |  18 +++---
>  3 files changed, 176 insertions(+), 90 deletions(-)
> 
> -- 
> 2.46.1
>
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Zhang Yi 1 week ago
Hi, Ojaswin!

On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>> Changes since v1:
>>  - Rebase the codes based on the latest linux-next 20251120.
>>  - Add patches 01-05, fix two stale data problems caused by
> 
> Hi Zhang, thanks for the patches.
> 

Thank you for take time to look at this series.

> I've always felt uncomfortable with the ZEROOUT code here because it
> seems to have many such bugs as you pointed out in the series. Its very
> fragile and the bugs are easy to miss behind all the data valid and
> split flags mess. 
> 

Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
significantly increased the complexity of split extents and the
potential for bugs.

> As per my understanding, ZEROOUT logic seems to be a special best-effort
> try to make the split/convert operation "work" when dealing with
> transient errors like ENOSPC etc. I was just wondering if it makes sense
> to just get rid of the whole ZEROOUT logic completely and just reset the
> extent to orig state if there is any error. This allows us to get rid of
> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
> slightly less messy.
> 
> Maybe we can have a retry loop at the top level caller if we want to try
> again for say ENOSPC or ENOMEM. 
> 
> Would love to hear your thoughts on it.
> 

I think this is a direction worth exploring. However, what I am
currently considering is that we need to address this scenario of
splitting extent during the I/O completion. Although the ZEROOUT logic
is fragile and has many issues recently, it currently serves as a
fallback solution for handling ENOSPC errors that arise when splitting
extents during I/O completion. It ensures that I/O operations do not
fail due to insufficient extent blocks.

Please see ext4_convert_unwritten_extents_endio(). Although we have made
our best effort to tried to split extents using
EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
covered all scenarios. Moreover, after converting the buffered I/O path
to the iomap infrastructure in the future, we may need to split extents
during the I/O completion worker[1].

In most block allocation processes, we already have a retry loop to deal
with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
doesn't seem appropriate to place this logic into the I/O completion
handling process (I haven't thought this solution through deeply yet,
but I'm afraid it could introduce potential deadlock risks due to its
involvement with journal operations), and we can't just simply try again.
If we remove the ZEROOUT logic, we may lose our last line of defense
during the I/O completion.

Currently, I am considering whether it is possible to completely remove
EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
submitting I/Os; instead, all splitting would be performed when
converting extents to written after the I/O completes. Based on my patch,
"ext4: use reserved metadata blocks when splitting extent on endio"[2],
and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
shows no regressions.

So I think the ZEROOUT logic remains somewhat useful until we find better
solution(e.g., making more precise reservations for metadata). Perhaps we
can refactor both the split extent and ZEROOUT logic to make them more
concise.

[1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
[2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/

Cheers,
Yi.

> Thanks,
> Ojaswin
> 
>>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
>>  - Add patches 06-07, fix two stale extent status entries problems also
>>    caused by splitting extent.
>>  - Modify patches 08-10, extend __es_remove_extent() and
>>    ext4_es_cache_extent() to allow them to overwrite existing extents of
>>    the same status when caching on-disk extents, while also checking
>>    extents of different stauts and raising alarms to prevent misuse.
>>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>>    remove the TODO comment in it.
>>
>> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
>>
>> Original Description
>>
>> This series addresses the optimization that Jan pointed out [1]
>> regarding the introduction of a sequence number to
>> ext4_es_insert_extent(). The proposal is to replace all instances where
>> the cache of on-disk extents is updated by using ext4_es_cache_extent()
>> instead of ext4_es_insert_extent(). This change can prevent excessive
>> cache invalidations caused by unnecessarily increasing the extent
>> sequence number when reading from the on-disk extent tree.
>>
>> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
>>
>> Cheers,
>> Yi.
>>
>> Zhang Yi (13):
>>   ext4: cleanup zeroout in ext4_split_extent_at()
>>   ext4: subdivide EXT4_EXT_DATA_VALID1
>>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>>     submitting I/O
>>   ext4: correct the mapping status if the extent has been zeroed
>>   ext4: don't cache extent during splitting extent
>>   ext4: drop extent cache before splitting extent
>>   ext4: cleanup useless out tag in __es_remove_extent()
>>   ext4: make __es_remove_extent() check extent status
>>   ext4: make ext4_es_cache_extent() support overwrite existing extents
>>   ext4: adjust the debug info in ext4_es_cache_extent()
>>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
>>   ext4: drop the TODO comment in ext4_es_insert_extent()
>>
>>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
>>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>>  fs/ext4/inode.c          |  18 +++---
>>  3 files changed, 176 insertions(+), 90 deletions(-)
>>
>> -- 
>> 2.46.1
>>
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Jan Kara 4 days, 12 hours ago
Hi Yi!

On Mon 24-11-25 13:04:04, Zhang Yi wrote:
> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> >> Changes since v1:
> >>  - Rebase the codes based on the latest linux-next 20251120.
> >>  - Add patches 01-05, fix two stale data problems caused by
> > 
> > Hi Zhang, thanks for the patches.
> > 
> 
> Thank you for take time to look at this series.
> 
> > I've always felt uncomfortable with the ZEROOUT code here because it
> > seems to have many such bugs as you pointed out in the series. Its very
> > fragile and the bugs are easy to miss behind all the data valid and
> > split flags mess. 
> > 
> 
> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
> significantly increased the complexity of split extents and the
> potential for bugs.

Yep, that code is complex and prone to bugs.

> > As per my understanding, ZEROOUT logic seems to be a special best-effort
> > try to make the split/convert operation "work" when dealing with
> > transient errors like ENOSPC etc. I was just wondering if it makes sense
> > to just get rid of the whole ZEROOUT logic completely and just reset the
> > extent to orig state if there is any error. This allows us to get rid of
> > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
> > slightly less messy.
> > 
> > Maybe we can have a retry loop at the top level caller if we want to try
> > again for say ENOSPC or ENOMEM. 
> > 
> > Would love to hear your thoughts on it.
> 
> I think this is a direction worth exploring. However, what I am
> currently considering is that we need to address this scenario of
> splitting extent during the I/O completion. Although the ZEROOUT logic
> is fragile and has many issues recently, it currently serves as a
> fallback solution for handling ENOSPC errors that arise when splitting
> extents during I/O completion. It ensures that I/O operations do not
> fail due to insufficient extent blocks.

Also partial extent zeroout offers a good performance win when the
portion needing zeroout is small (we can save extent splitting). And I
agree it is a good safety net for ENOSPC issues - otherwise there's no
guarantee page writeback can finish without hitting ENOSPC. We do have
reserved blocks for these cases but the pool is limited so you can still
run out of blocks if you try hard enough.

> Please see ext4_convert_unwritten_extents_endio(). Although we have made
> our best effort to tried to split extents using
> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
> covered all scenarios. Moreover, after converting the buffered I/O path
> to the iomap infrastructure in the future, we may need to split extents
> during the I/O completion worker[1].

Yes, this might be worth exploring. The advantage of doing extent splitting
in advance is that on IO submission you have the opportunity of restarting
the transaction on ENOSPC to possibly release some blocks. This is not
easily doable e.g. on writeback completion so the pressure on the pool of
reserved blocks is going to be more common (previously you needed reserved
blocks only when writeback was racing with fallocate or similar, now you
may need them each time you write in the middle of unwritten extent). So I
think the change will need some testing whether it isn't too easy to hit
ENOSPC conditions on IO completion without EXT4_GET_BLOCKS_IO_CREATE_EXT.
But otherwise it's always nice to remove code :)
 
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Zhang Yi 3 days, 20 hours ago
On 11/27/2025 8:24 PM, Jan Kara wrote:
> Hi Yi!
> 
> On Mon 24-11-25 13:04:04, Zhang Yi wrote:
>> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
>>> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>>>> Changes since v1:
>>>>  - Rebase the codes based on the latest linux-next 20251120.
>>>>  - Add patches 01-05, fix two stale data problems caused by
>>>
>>> Hi Zhang, thanks for the patches.
>>>
>>
>> Thank you for take time to look at this series.
>>
>>> I've always felt uncomfortable with the ZEROOUT code here because it
>>> seems to have many such bugs as you pointed out in the series. Its very
>>> fragile and the bugs are easy to miss behind all the data valid and
>>> split flags mess. 
>>>
>>
>> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
>> significantly increased the complexity of split extents and the
>> potential for bugs.
> 
> Yep, that code is complex and prone to bugs.
> 
>>> As per my understanding, ZEROOUT logic seems to be a special best-effort
>>> try to make the split/convert operation "work" when dealing with
>>> transient errors like ENOSPC etc. I was just wondering if it makes sense
>>> to just get rid of the whole ZEROOUT logic completely and just reset the
>>> extent to orig state if there is any error. This allows us to get rid of
>>> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
>>> slightly less messy.
>>>
>>> Maybe we can have a retry loop at the top level caller if we want to try
>>> again for say ENOSPC or ENOMEM. 
>>>
>>> Would love to hear your thoughts on it.
>>
>> I think this is a direction worth exploring. However, what I am
>> currently considering is that we need to address this scenario of
>> splitting extent during the I/O completion. Although the ZEROOUT logic
>> is fragile and has many issues recently, it currently serves as a
>> fallback solution for handling ENOSPC errors that arise when splitting
>> extents during I/O completion. It ensures that I/O operations do not
>> fail due to insufficient extent blocks.
> 
> Also partial extent zeroout offers a good performance win when the
> portion needing zeroout is small (we can save extent splitting). And I
> agree it is a good safety net for ENOSPC issues - otherwise there's no
> guarantee page writeback can finish without hitting ENOSPC. We do have
> reserved blocks for these cases but the pool is limited so you can still
> run out of blocks if you try hard enough.

Yes, Indeed!

> 
>> Please see ext4_convert_unwritten_extents_endio(). Although we have made
>> our best effort to tried to split extents using
>> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
>> covered all scenarios. Moreover, after converting the buffered I/O path
>> to the iomap infrastructure in the future, we may need to split extents
>> during the I/O completion worker[1].
> 
> Yes, this might be worth exploring. The advantage of doing extent splitting
> in advance is that on IO submission you have the opportunity of restarting
> the transaction on ENOSPC to possibly release some blocks. This is not
> easily doable e.g. on writeback completion so the pressure on the pool of
> reserved blocks is going to be more common (previously you needed reserved
> blocks only when writeback was racing with fallocate or similar, now you
> may need them each time you write in the middle of unwritten extent). So I
> think the change will need some testing whether it isn't too easy to hit
> ENOSPC conditions on IO completion without EXT4_GET_BLOCKS_IO_CREATE_EXT.
> But otherwise it's always nice to remove code :)
>  
> 								Honza

Yes, we need to be very careful about this, I have written a POC patch and
am currently conducting related tests. I hope it works.  :-)

Thanks,
Yi.
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Ojaswin Mujoo 1 week ago
On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
> 
> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
> > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
> >> Changes since v1:
> >>  - Rebase the codes based on the latest linux-next 20251120.
> >>  - Add patches 01-05, fix two stale data problems caused by
> > 
> > Hi Zhang, thanks for the patches.
> > 
> 
> Thank you for take time to look at this series.
> 
> > I've always felt uncomfortable with the ZEROOUT code here because it
> > seems to have many such bugs as you pointed out in the series. Its very
> > fragile and the bugs are easy to miss behind all the data valid and
> > split flags mess. 
> > 
> 
> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
> significantly increased the complexity of split extents and the
> potential for bugs.
> 
> > As per my understanding, ZEROOUT logic seems to be a special best-effort
> > try to make the split/convert operation "work" when dealing with
> > transient errors like ENOSPC etc. I was just wondering if it makes sense
> > to just get rid of the whole ZEROOUT logic completely and just reset the
> > extent to orig state if there is any error. This allows us to get rid of
> > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
> > slightly less messy.
> > 
> > Maybe we can have a retry loop at the top level caller if we want to try
> > again for say ENOSPC or ENOMEM. 
> > 
> > Would love to hear your thoughts on it.
> > 
> 
> I think this is a direction worth exploring. However, what I am
> currently considering is that we need to address this scenario of
> splitting extent during the I/O completion. Although the ZEROOUT logic
> is fragile and has many issues recently, it currently serves as a
> fallback solution for handling ENOSPC errors that arise when splitting
> extents during I/O completion. It ensures that I/O operations do not
> fail due to insufficient extent blocks.
> 
> Please see ext4_convert_unwritten_extents_endio(). Although we have made
> our best effort to tried to split extents using
> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
> covered all scenarios. Moreover, after converting the buffered I/O path
> to the iomap infrastructure in the future, we may need to split extents
> during the I/O completion worker[1].
> 
> In most block allocation processes, we already have a retry loop to deal
> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
> doesn't seem appropriate to place this logic into the I/O completion
> handling process (I haven't thought this solution through deeply yet,
> but I'm afraid it could introduce potential deadlock risks due to its
> involvement with journal operations), and we can't just simply try again.
> If we remove the ZEROOUT logic, we may lose our last line of defense
> during the I/O completion.
> 
> Currently, I am considering whether it is possible to completely remove
> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
> submitting I/Os; instead, all splitting would be performed when
> converting extents to written after the I/O completes. Based on my patch,
> "ext4: use reserved metadata blocks when splitting extent on endio"[2],
> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
> shows no regressions.
> 
> So I think the ZEROOUT logic remains somewhat useful until we find better
> solution(e.g., making more precise reservations for metadata). Perhaps we
> can refactor both the split extent and ZEROOUT logic to make them more
> concise.

Hi Yi,

Okay it makese sense to keep the zeroout if iomap path is planning to
shift the extent splitting to endio. Plus, I agree based on the comments
in the ext4_convert_unwritte_extents_endio() that we might even today
need to split in endio (although i cant recall when this happens) which
would need a zeroout fallback.

And yes, I think refactoring the whole logic to be less confusing would
be better. I had an older unposted untested patch cleaning up some of
this, I was looking at it again today and there seems to be a lot of
cleanups we can do here but that becomes out of scope of this patchset I
believe.

Sure then, lets keep it as it is for now. I'll review the changes you
made and later I can post a patch refactoring this area.

Regards,
ojaswin

> 
> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
> 
> Cheers,
> Yi.
> 
> > Thanks,
> > Ojaswin
> > 
> >>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
> >>  - Add patches 06-07, fix two stale extent status entries problems also
> >>    caused by splitting extent.
> >>  - Modify patches 08-10, extend __es_remove_extent() and
> >>    ext4_es_cache_extent() to allow them to overwrite existing extents of
> >>    the same status when caching on-disk extents, while also checking
> >>    extents of different stauts and raising alarms to prevent misuse.
> >>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
> >>    remove the TODO comment in it.
> >>
> >> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
> >>
> >> Original Description
> >>
> >> This series addresses the optimization that Jan pointed out [1]
> >> regarding the introduction of a sequence number to
> >> ext4_es_insert_extent(). The proposal is to replace all instances where
> >> the cache of on-disk extents is updated by using ext4_es_cache_extent()
> >> instead of ext4_es_insert_extent(). This change can prevent excessive
> >> cache invalidations caused by unnecessarily increasing the extent
> >> sequence number when reading from the on-disk extent tree.
> >>
> >> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
> >>
> >> Cheers,
> >> Yi.
> >>
> >> Zhang Yi (13):
> >>   ext4: cleanup zeroout in ext4_split_extent_at()
> >>   ext4: subdivide EXT4_EXT_DATA_VALID1
> >>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
> >>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
> >>     submitting I/O
> >>   ext4: correct the mapping status if the extent has been zeroed
> >>   ext4: don't cache extent during splitting extent
> >>   ext4: drop extent cache before splitting extent
> >>   ext4: cleanup useless out tag in __es_remove_extent()
> >>   ext4: make __es_remove_extent() check extent status
> >>   ext4: make ext4_es_cache_extent() support overwrite existing extents
> >>   ext4: adjust the debug info in ext4_es_cache_extent()
> >>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
> >>   ext4: drop the TODO comment in ext4_es_insert_extent()
> >>
> >>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
> >>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
> >>  fs/ext4/inode.c          |  18 +++---
> >>  3 files changed, 176 insertions(+), 90 deletions(-)
> >>
> >> -- 
> >> 2.46.1
> >>
>
Re: [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents
Posted by Zhang Yi 1 week ago
On 11/24/2025 8:50 PM, Ojaswin Mujoo wrote:
> On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote:
>> Hi, Ojaswin!
>>
>> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote:
>>> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote:
>>>> Changes since v1:
>>>>  - Rebase the codes based on the latest linux-next 20251120.
>>>>  - Add patches 01-05, fix two stale data problems caused by
>>>
>>> Hi Zhang, thanks for the patches.
>>>
>>
>> Thank you for take time to look at this series.
>>
>>> I've always felt uncomfortable with the ZEROOUT code here because it
>>> seems to have many such bugs as you pointed out in the series. Its very
>>> fragile and the bugs are easy to miss behind all the data valid and
>>> split flags mess. 
>>>
>>
>> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has
>> significantly increased the complexity of split extents and the
>> potential for bugs.
>>
>>> As per my understanding, ZEROOUT logic seems to be a special best-effort
>>> try to make the split/convert operation "work" when dealing with
>>> transient errors like ENOSPC etc. I was just wondering if it makes sense
>>> to just get rid of the whole ZEROOUT logic completely and just reset the
>>> extent to orig state if there is any error. This allows us to get rid of
>>> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() 
>>> slightly less messy.
>>>
>>> Maybe we can have a retry loop at the top level caller if we want to try
>>> again for say ENOSPC or ENOMEM. 
>>>
>>> Would love to hear your thoughts on it.
>>>
>>
>> I think this is a direction worth exploring. However, what I am
>> currently considering is that we need to address this scenario of
>> splitting extent during the I/O completion. Although the ZEROOUT logic
>> is fragile and has many issues recently, it currently serves as a
>> fallback solution for handling ENOSPC errors that arise when splitting
>> extents during I/O completion. It ensures that I/O operations do not
>> fail due to insufficient extent blocks.
>>
>> Please see ext4_convert_unwritten_extents_endio(). Although we have made
>> our best effort to tried to split extents using
>> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not
>> covered all scenarios. Moreover, after converting the buffered I/O path
>> to the iomap infrastructure in the future, we may need to split extents
>> during the I/O completion worker[1].
>>
>> In most block allocation processes, we already have a retry loop to deal
>> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it
>> doesn't seem appropriate to place this logic into the I/O completion
>> handling process (I haven't thought this solution through deeply yet,
>> but I'm afraid it could introduce potential deadlock risks due to its
>> involvement with journal operations), and we can't just simply try again.
>> If we remove the ZEROOUT logic, we may lose our last line of defense
>> during the I/O completion.
>>
>> Currently, I am considering whether it is possible to completely remove
>> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before
>> submitting I/Os; instead, all splitting would be performed when
>> converting extents to written after the I/O completes. Based on my patch,
>> "ext4: use reserved metadata blocks when splitting extent on endio"[2],
>> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld
>> shows no regressions.
>>
>> So I think the ZEROOUT logic remains somewhat useful until we find better
>> solution(e.g., making more precise reservations for metadata). Perhaps we
>> can refactor both the split extent and ZEROOUT logic to make them more
>> concise.
> 
> Hi Yi,
> 
> Okay it makese sense to keep the zeroout if iomap path is planning to
> shift the extent splitting to endio. Plus, I agree based on the comments
> in the ext4_convert_unwritte_extents_endio() that we might even today
> need to split in endio (although i cant recall when this happens) which
> would need a zeroout fallback.

A relatively common case is the concurrency between folio write-back and
fallocate. After an unwritten extent is allocated during the writeback
process, if fallocate is performed before the I/O completes, the unwritten
extent created by fallocate will merge with the writeback portion.
Consequently, a split operation is required once the I/O completes.

> 
> And yes, I think refactoring the whole logic to be less confusing would
> be better. I had an older unposted untested patch cleaning up some of
> this, I was looking at it again today and there seems to be a lot of
> cleanups we can do here but that becomes out of scope of this patchset I
> believe.
> 
> Sure then, lets keep it as it is for now. I'll review the changes you
> made and later I can post a patch refactoring this area.

OK, thank you a lot for your review and look forward to your patch.

Thanks,
Yi.

> 
> Regards,
> ojaswin
> 
>>
>> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/
>> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/
>>
>> Cheers,
>> Yi.
>>
>>> Thanks,
>>> Ojaswin
>>>
>>>>    EXT4_EXT_MAY_ZEROOUT when splitting extent.
>>>>  - Add patches 06-07, fix two stale extent status entries problems also
>>>>    caused by splitting extent.
>>>>  - Modify patches 08-10, extend __es_remove_extent() and
>>>>    ext4_es_cache_extent() to allow them to overwrite existing extents of
>>>>    the same status when caching on-disk extents, while also checking
>>>>    extents of different stauts and raising alarms to prevent misuse.
>>>>  - Add patch 13 to clear the usage of ext4_es_insert_extent(), and
>>>>    remove the TODO comment in it.
>>>>
>>>> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
>>>>
>>>> Original Description
>>>>
>>>> This series addresses the optimization that Jan pointed out [1]
>>>> regarding the introduction of a sequence number to
>>>> ext4_es_insert_extent(). The proposal is to replace all instances where
>>>> the cache of on-disk extents is updated by using ext4_es_cache_extent()
>>>> instead of ext4_es_insert_extent(). This change can prevent excessive
>>>> cache invalidations caused by unnecessarily increasing the extent
>>>> sequence number when reading from the on-disk extent tree.
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
>>>>
>>>> Cheers,
>>>> Yi.
>>>>
>>>> Zhang Yi (13):
>>>>   ext4: cleanup zeroout in ext4_split_extent_at()
>>>>   ext4: subdivide EXT4_EXT_DATA_VALID1
>>>>   ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
>>>>   ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
>>>>     submitting I/O
>>>>   ext4: correct the mapping status if the extent has been zeroed
>>>>   ext4: don't cache extent during splitting extent
>>>>   ext4: drop extent cache before splitting extent
>>>>   ext4: cleanup useless out tag in __es_remove_extent()
>>>>   ext4: make __es_remove_extent() check extent status
>>>>   ext4: make ext4_es_cache_extent() support overwrite existing extents
>>>>   ext4: adjust the debug info in ext4_es_cache_extent()
>>>>   ext4: replace ext4_es_insert_extent() when caching on-disk extents
>>>>   ext4: drop the TODO comment in ext4_es_insert_extent()
>>>>
>>>>  fs/ext4/extents.c        | 127 +++++++++++++++++++++++----------------
>>>>  fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
>>>>  fs/ext4/inode.c          |  18 +++---
>>>>  3 files changed, 176 insertions(+), 90 deletions(-)
>>>>
>>>> -- 
>>>> 2.46.1
>>>>
>>