fs/ext4/extents.c | 127 +++++++++++++++++++++++---------------- fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++--------- fs/ext4/inode.c | 18 +++--- 3 files changed, 176 insertions(+), 90 deletions(-)
Changes since v1:
- Rebase the codes based on the latest linux-next 20251120.
- Add patches 01-05, fix two stale data problems caused by
EXT4_EXT_MAY_ZEROOUT when splitting extent.
- Add patches 06-07, fix two stale extent status entries problems also
caused by splitting extent.
- Modify patches 08-10, extend __es_remove_extent() and
ext4_es_cache_extent() to allow them to overwrite existing extents of
the same status when caching on-disk extents, while also checking
extents of different stauts and raising alarms to prevent misuse.
- Add patch 13 to clear the usage of ext4_es_insert_extent(), and
remove the TODO comment in it.
v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/
Original Description
This series addresses the optimization that Jan pointed out [1]
regarding the introduction of a sequence number to
ext4_es_insert_extent(). The proposal is to replace all instances where
the cache of on-disk extents is updated by using ext4_es_cache_extent()
instead of ext4_es_insert_extent(). This change can prevent excessive
cache invalidations caused by unnecessarily increasing the extent
sequence number when reading from the on-disk extent tree.
[1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/
Cheers,
Yi.
Zhang Yi (13):
ext4: cleanup zeroout in ext4_split_extent_at()
ext4: subdivide EXT4_EXT_DATA_VALID1
ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before
submitting I/O
ext4: correct the mapping status if the extent has been zeroed
ext4: don't cache extent during splitting extent
ext4: drop extent cache before splitting extent
ext4: cleanup useless out tag in __es_remove_extent()
ext4: make __es_remove_extent() check extent status
ext4: make ext4_es_cache_extent() support overwrite existing extents
ext4: adjust the debug info in ext4_es_cache_extent()
ext4: replace ext4_es_insert_extent() when caching on-disk extents
ext4: drop the TODO comment in ext4_es_insert_extent()
fs/ext4/extents.c | 127 +++++++++++++++++++++++----------------
fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++---------
fs/ext4/inode.c | 18 +++---
3 files changed, 176 insertions(+), 90 deletions(-)
--
2.46.1
Thanks to Jan and Ojaswin for their comments and reviews on this patch set. As you may have noticed this version of the patchset is currently in the ext4.git tree, and has shown up in linux-next. Given that we have some suggested changes, there are a couple of ways we can handle this: 1) Zhang Yi could produce a new version of the patch set, and I'll replace the v2 version of the patch set currently in the ext4.git tree with a newer version. 2) We could append fix-up patches to the ext4.git tree to reflect those changes. 3) I could drop the patch set entirely and we apply a later version of the patch series after the merge window. What are folks' preferences? Thanks, - Ted
Hi, Ted! On 11/29/2025 12:49 AM, Theodore Tso wrote: > Thanks to Jan and Ojaswin for their comments and reviews on this patch set. > > As you may have noticed this version of the patchset is currently in > the ext4.git tree, and has shown up in linux-next. The ext4.git tree[1] shows that only the first three patches from this v2 version have been merged, possibly because the fourth patch conflicted with ErKun's patch[2]. I've written a new version locally, adding two fix patches for the three merged patches, and then rebase the subsequent patches and modify them directly. I can send them out after texting. Is this okay? Thanks, Yi. [1] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev [2] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=dac092195b6a35bc7c9f11e2884cfecb1b25e20c > > Given that we have some suggested changes, there are a couple of ways > we can handle this: > > 1) Zhang Yi could produce a new version of the patch set, and I'll > replace the v2 version of the patch set currently in the ext4.git tree > with a newer version. > > 2) We could append fix-up patches to the ext4.git tree to reflect > those changes. > > 3) I could drop the patch set entirely and we apply a later version > of the patch series after the merge window. > > What are folks' preferences? > > Thanks, > > - Ted
On Sat, Nov 29, 2025 at 09:32:25AM +0800, Zhang Yi wrote: > > The ext4.git tree[1] shows that only the first three patches from this > v2 version have been merged, possibly because the fourth patch conflicted > with ErKun's patch[2]. Yeah, oops. Sorry about that. I think I had checked some other branch via a git reset --hard, and forgot that I had accidentally aborted the git am after the patch conflict. I've rearranged the dev brach so that those first three patches are not at the tip of the dev branch. So if you want to grab the dev branch, and then rebase your new series on top of commit 91ef18b567da, you can do that. * 1e366d088866 - (HEAD -> dev, ext4/dev) ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 (10 minutes ago) * 6afae3414127 - ext4: subdivide EXT4_EXT_DATA_VALID1 (10 minutes ago) * a4e2cc74a11b - ext4: cleanup zeroout in ext4_split_extent_at() (10 minutes ago) * 91ef18b567da - ext4: mark inodes without acls in __ext4_iget() (10 minutes ago) Note that the merge window opens on this coming Sunday, but a good number of these patches are bug fixes (and not in the sense of adding a new feature :-), so I'd prefer to land them this cycle if possible. > I've written a new version locally, adding two fix > patches for the three merged patches, and then rebase the subsequent > patches and modify them directly. I can send them out after texting. > Is this okay? Why don't you modify your series so it applies on top of 91ef18b567da, so we don't hace to have the fixup patches, and I may just simply push everything up to 91ef18b567da to Linus, and we can then just apply the next version of your series on top of that commit, and I'll push it to Linus right after -rc1. Does that seem workable to you? - Ted
On 11/29/2025 11:52 AM, Theodore Tso wrote: > On Sat, Nov 29, 2025 at 09:32:25AM +0800, Zhang Yi wrote: >> >> The ext4.git tree[1] shows that only the first three patches from this >> v2 version have been merged, possibly because the fourth patch conflicted >> with ErKun's patch[2]. > > Yeah, oops. Sorry about that. I think I had checked some other > branch via a git reset --hard, and forgot that I had accidentally > aborted the git am after the patch conflict. > > I've rearranged the dev brach so that those first three patches are > not at the tip of the dev branch. So if you want to grab the dev > branch, and then rebase your new series on top of commit 91ef18b567da, > you can do that. > > * 1e366d088866 - (HEAD -> dev, ext4/dev) ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 (10 minutes ago) > * 6afae3414127 - ext4: subdivide EXT4_EXT_DATA_VALID1 (10 minutes ago) > * a4e2cc74a11b - ext4: cleanup zeroout in ext4_split_extent_at() (10 minutes ago) > * 91ef18b567da - ext4: mark inodes without acls in __ext4_iget() (10 minutes ago) > > Note that the merge window opens on this coming Sunday, but a good > number of these patches are bug fixes (and not in the sense of adding > a new feature :-), so I'd prefer to land them this cycle if possible. > >> I've written a new version locally, adding two fix >> patches for the three merged patches, and then rebase the subsequent >> patches and modify them directly. I can send them out after texting. >> Is this okay? > > Why don't you modify your series so it applies on top of 91ef18b567da, > so we don't hace to have the fixup patches, and I may just simply push > everything up to 91ef18b567da to Linus, and we can then just apply the > next version of your series on top of that commit, and I'll push it to > Linus right after -rc1. > > Does that seem workable to you? > > - Ted Sure, I will rebase my series on top of 91ef18b567da and send out today. Cheers, Yi.
On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote: > Changes since v1: > - Rebase the codes based on the latest linux-next 20251120. > - Add patches 01-05, fix two stale data problems caused by Hi Zhang, thanks for the patches. I've always felt uncomfortable with the ZEROOUT code here because it seems to have many such bugs as you pointed out in the series. Its very fragile and the bugs are easy to miss behind all the data valid and split flags mess. As per my understanding, ZEROOUT logic seems to be a special best-effort try to make the split/convert operation "work" when dealing with transient errors like ENOSPC etc. I was just wondering if it makes sense to just get rid of the whole ZEROOUT logic completely and just reset the extent to orig state if there is any error. This allows us to get rid of DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() slightly less messy. Maybe we can have a retry loop at the top level caller if we want to try again for say ENOSPC or ENOMEM. Would love to hear your thoughts on it. Thanks, Ojaswin > EXT4_EXT_MAY_ZEROOUT when splitting extent. > - Add patches 06-07, fix two stale extent status entries problems also > caused by splitting extent. > - Modify patches 08-10, extend __es_remove_extent() and > ext4_es_cache_extent() to allow them to overwrite existing extents of > the same status when caching on-disk extents, while also checking > extents of different stauts and raising alarms to prevent misuse. > - Add patch 13 to clear the usage of ext4_es_insert_extent(), and > remove the TODO comment in it. > > v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/ > > Original Description > > This series addresses the optimization that Jan pointed out [1] > regarding the introduction of a sequence number to > ext4_es_insert_extent(). The proposal is to replace all instances where > the cache of on-disk extents is updated by using ext4_es_cache_extent() > instead of ext4_es_insert_extent(). This change can prevent excessive > cache invalidations caused by unnecessarily increasing the extent > sequence number when reading from the on-disk extent tree. > > [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/ > > Cheers, > Yi. > > Zhang Yi (13): > ext4: cleanup zeroout in ext4_split_extent_at() > ext4: subdivide EXT4_EXT_DATA_VALID1 > ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 > ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before > submitting I/O > ext4: correct the mapping status if the extent has been zeroed > ext4: don't cache extent during splitting extent > ext4: drop extent cache before splitting extent > ext4: cleanup useless out tag in __es_remove_extent() > ext4: make __es_remove_extent() check extent status > ext4: make ext4_es_cache_extent() support overwrite existing extents > ext4: adjust the debug info in ext4_es_cache_extent() > ext4: replace ext4_es_insert_extent() when caching on-disk extents > ext4: drop the TODO comment in ext4_es_insert_extent() > > fs/ext4/extents.c | 127 +++++++++++++++++++++++---------------- > fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++--------- > fs/ext4/inode.c | 18 +++--- > 3 files changed, 176 insertions(+), 90 deletions(-) > > -- > 2.46.1 >
Hi, Ojaswin! On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote: > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote: >> Changes since v1: >> - Rebase the codes based on the latest linux-next 20251120. >> - Add patches 01-05, fix two stale data problems caused by > > Hi Zhang, thanks for the patches. > Thank you for take time to look at this series. > I've always felt uncomfortable with the ZEROOUT code here because it > seems to have many such bugs as you pointed out in the series. Its very > fragile and the bugs are easy to miss behind all the data valid and > split flags mess. > Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has significantly increased the complexity of split extents and the potential for bugs. > As per my understanding, ZEROOUT logic seems to be a special best-effort > try to make the split/convert operation "work" when dealing with > transient errors like ENOSPC etc. I was just wondering if it makes sense > to just get rid of the whole ZEROOUT logic completely and just reset the > extent to orig state if there is any error. This allows us to get rid of > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() > slightly less messy. > > Maybe we can have a retry loop at the top level caller if we want to try > again for say ENOSPC or ENOMEM. > > Would love to hear your thoughts on it. > I think this is a direction worth exploring. However, what I am currently considering is that we need to address this scenario of splitting extent during the I/O completion. Although the ZEROOUT logic is fragile and has many issues recently, it currently serves as a fallback solution for handling ENOSPC errors that arise when splitting extents during I/O completion. It ensures that I/O operations do not fail due to insufficient extent blocks. Please see ext4_convert_unwritten_extents_endio(). Although we have made our best effort to tried to split extents using EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not covered all scenarios. Moreover, after converting the buffered I/O path to the iomap infrastructure in the future, we may need to split extents during the I/O completion worker[1]. In most block allocation processes, we already have a retry loop to deal with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it doesn't seem appropriate to place this logic into the I/O completion handling process (I haven't thought this solution through deeply yet, but I'm afraid it could introduce potential deadlock risks due to its involvement with journal operations), and we can't just simply try again. If we remove the ZEROOUT logic, we may lose our last line of defense during the I/O completion. Currently, I am considering whether it is possible to completely remove EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before submitting I/Os; instead, all splitting would be performed when converting extents to written after the I/O completes. Based on my patch, "ext4: use reserved metadata blocks when splitting extent on endio"[2], and the ZEROOUT logic, this approach appears feasible, and xfstest-bld shows no regressions. So I think the ZEROOUT logic remains somewhat useful until we find better solution(e.g., making more precise reservations for metadata). Perhaps we can refactor both the split extent and ZEROOUT logic to make them more concise. [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/ [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/ Cheers, Yi. > Thanks, > Ojaswin > >> EXT4_EXT_MAY_ZEROOUT when splitting extent. >> - Add patches 06-07, fix two stale extent status entries problems also >> caused by splitting extent. >> - Modify patches 08-10, extend __es_remove_extent() and >> ext4_es_cache_extent() to allow them to overwrite existing extents of >> the same status when caching on-disk extents, while also checking >> extents of different stauts and raising alarms to prevent misuse. >> - Add patch 13 to clear the usage of ext4_es_insert_extent(), and >> remove the TODO comment in it. >> >> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/ >> >> Original Description >> >> This series addresses the optimization that Jan pointed out [1] >> regarding the introduction of a sequence number to >> ext4_es_insert_extent(). The proposal is to replace all instances where >> the cache of on-disk extents is updated by using ext4_es_cache_extent() >> instead of ext4_es_insert_extent(). This change can prevent excessive >> cache invalidations caused by unnecessarily increasing the extent >> sequence number when reading from the on-disk extent tree. >> >> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/ >> >> Cheers, >> Yi. >> >> Zhang Yi (13): >> ext4: cleanup zeroout in ext4_split_extent_at() >> ext4: subdivide EXT4_EXT_DATA_VALID1 >> ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 >> ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before >> submitting I/O >> ext4: correct the mapping status if the extent has been zeroed >> ext4: don't cache extent during splitting extent >> ext4: drop extent cache before splitting extent >> ext4: cleanup useless out tag in __es_remove_extent() >> ext4: make __es_remove_extent() check extent status >> ext4: make ext4_es_cache_extent() support overwrite existing extents >> ext4: adjust the debug info in ext4_es_cache_extent() >> ext4: replace ext4_es_insert_extent() when caching on-disk extents >> ext4: drop the TODO comment in ext4_es_insert_extent() >> >> fs/ext4/extents.c | 127 +++++++++++++++++++++++---------------- >> fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++--------- >> fs/ext4/inode.c | 18 +++--- >> 3 files changed, 176 insertions(+), 90 deletions(-) >> >> -- >> 2.46.1 >>
Hi Yi! On Mon 24-11-25 13:04:04, Zhang Yi wrote: > On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote: > > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote: > >> Changes since v1: > >> - Rebase the codes based on the latest linux-next 20251120. > >> - Add patches 01-05, fix two stale data problems caused by > > > > Hi Zhang, thanks for the patches. > > > > Thank you for take time to look at this series. > > > I've always felt uncomfortable with the ZEROOUT code here because it > > seems to have many such bugs as you pointed out in the series. Its very > > fragile and the bugs are easy to miss behind all the data valid and > > split flags mess. > > > > Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has > significantly increased the complexity of split extents and the > potential for bugs. Yep, that code is complex and prone to bugs. > > As per my understanding, ZEROOUT logic seems to be a special best-effort > > try to make the split/convert operation "work" when dealing with > > transient errors like ENOSPC etc. I was just wondering if it makes sense > > to just get rid of the whole ZEROOUT logic completely and just reset the > > extent to orig state if there is any error. This allows us to get rid of > > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() > > slightly less messy. > > > > Maybe we can have a retry loop at the top level caller if we want to try > > again for say ENOSPC or ENOMEM. > > > > Would love to hear your thoughts on it. > > I think this is a direction worth exploring. However, what I am > currently considering is that we need to address this scenario of > splitting extent during the I/O completion. Although the ZEROOUT logic > is fragile and has many issues recently, it currently serves as a > fallback solution for handling ENOSPC errors that arise when splitting > extents during I/O completion. It ensures that I/O operations do not > fail due to insufficient extent blocks. Also partial extent zeroout offers a good performance win when the portion needing zeroout is small (we can save extent splitting). And I agree it is a good safety net for ENOSPC issues - otherwise there's no guarantee page writeback can finish without hitting ENOSPC. We do have reserved blocks for these cases but the pool is limited so you can still run out of blocks if you try hard enough. > Please see ext4_convert_unwritten_extents_endio(). Although we have made > our best effort to tried to split extents using > EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not > covered all scenarios. Moreover, after converting the buffered I/O path > to the iomap infrastructure in the future, we may need to split extents > during the I/O completion worker[1]. Yes, this might be worth exploring. The advantage of doing extent splitting in advance is that on IO submission you have the opportunity of restarting the transaction on ENOSPC to possibly release some blocks. This is not easily doable e.g. on writeback completion so the pressure on the pool of reserved blocks is going to be more common (previously you needed reserved blocks only when writeback was racing with fallocate or similar, now you may need them each time you write in the middle of unwritten extent). So I think the change will need some testing whether it isn't too easy to hit ENOSPC conditions on IO completion without EXT4_GET_BLOCKS_IO_CREATE_EXT. But otherwise it's always nice to remove code :) Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 11/27/2025 8:24 PM, Jan Kara wrote: > Hi Yi! > > On Mon 24-11-25 13:04:04, Zhang Yi wrote: >> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote: >>> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote: >>>> Changes since v1: >>>> - Rebase the codes based on the latest linux-next 20251120. >>>> - Add patches 01-05, fix two stale data problems caused by >>> >>> Hi Zhang, thanks for the patches. >>> >> >> Thank you for take time to look at this series. >> >>> I've always felt uncomfortable with the ZEROOUT code here because it >>> seems to have many such bugs as you pointed out in the series. Its very >>> fragile and the bugs are easy to miss behind all the data valid and >>> split flags mess. >>> >> >> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has >> significantly increased the complexity of split extents and the >> potential for bugs. > > Yep, that code is complex and prone to bugs. > >>> As per my understanding, ZEROOUT logic seems to be a special best-effort >>> try to make the split/convert operation "work" when dealing with >>> transient errors like ENOSPC etc. I was just wondering if it makes sense >>> to just get rid of the whole ZEROOUT logic completely and just reset the >>> extent to orig state if there is any error. This allows us to get rid of >>> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() >>> slightly less messy. >>> >>> Maybe we can have a retry loop at the top level caller if we want to try >>> again for say ENOSPC or ENOMEM. >>> >>> Would love to hear your thoughts on it. >> >> I think this is a direction worth exploring. However, what I am >> currently considering is that we need to address this scenario of >> splitting extent during the I/O completion. Although the ZEROOUT logic >> is fragile and has many issues recently, it currently serves as a >> fallback solution for handling ENOSPC errors that arise when splitting >> extents during I/O completion. It ensures that I/O operations do not >> fail due to insufficient extent blocks. > > Also partial extent zeroout offers a good performance win when the > portion needing zeroout is small (we can save extent splitting). And I > agree it is a good safety net for ENOSPC issues - otherwise there's no > guarantee page writeback can finish without hitting ENOSPC. We do have > reserved blocks for these cases but the pool is limited so you can still > run out of blocks if you try hard enough. Yes, Indeed! > >> Please see ext4_convert_unwritten_extents_endio(). Although we have made >> our best effort to tried to split extents using >> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not >> covered all scenarios. Moreover, after converting the buffered I/O path >> to the iomap infrastructure in the future, we may need to split extents >> during the I/O completion worker[1]. > > Yes, this might be worth exploring. The advantage of doing extent splitting > in advance is that on IO submission you have the opportunity of restarting > the transaction on ENOSPC to possibly release some blocks. This is not > easily doable e.g. on writeback completion so the pressure on the pool of > reserved blocks is going to be more common (previously you needed reserved > blocks only when writeback was racing with fallocate or similar, now you > may need them each time you write in the middle of unwritten extent). So I > think the change will need some testing whether it isn't too easy to hit > ENOSPC conditions on IO completion without EXT4_GET_BLOCKS_IO_CREATE_EXT. > But otherwise it's always nice to remove code :) > > Honza Yes, we need to be very careful about this, I have written a POC patch and am currently conducting related tests. I hope it works. :-) Thanks, Yi.
On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote: > Hi, Ojaswin! > > On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote: > > On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote: > >> Changes since v1: > >> - Rebase the codes based on the latest linux-next 20251120. > >> - Add patches 01-05, fix two stale data problems caused by > > > > Hi Zhang, thanks for the patches. > > > > Thank you for take time to look at this series. > > > I've always felt uncomfortable with the ZEROOUT code here because it > > seems to have many such bugs as you pointed out in the series. Its very > > fragile and the bugs are easy to miss behind all the data valid and > > split flags mess. > > > > Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has > significantly increased the complexity of split extents and the > potential for bugs. > > > As per my understanding, ZEROOUT logic seems to be a special best-effort > > try to make the split/convert operation "work" when dealing with > > transient errors like ENOSPC etc. I was just wondering if it makes sense > > to just get rid of the whole ZEROOUT logic completely and just reset the > > extent to orig state if there is any error. This allows us to get rid of > > DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() > > slightly less messy. > > > > Maybe we can have a retry loop at the top level caller if we want to try > > again for say ENOSPC or ENOMEM. > > > > Would love to hear your thoughts on it. > > > > I think this is a direction worth exploring. However, what I am > currently considering is that we need to address this scenario of > splitting extent during the I/O completion. Although the ZEROOUT logic > is fragile and has many issues recently, it currently serves as a > fallback solution for handling ENOSPC errors that arise when splitting > extents during I/O completion. It ensures that I/O operations do not > fail due to insufficient extent blocks. > > Please see ext4_convert_unwritten_extents_endio(). Although we have made > our best effort to tried to split extents using > EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not > covered all scenarios. Moreover, after converting the buffered I/O path > to the iomap infrastructure in the future, we may need to split extents > during the I/O completion worker[1]. > > In most block allocation processes, we already have a retry loop to deal > with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it > doesn't seem appropriate to place this logic into the I/O completion > handling process (I haven't thought this solution through deeply yet, > but I'm afraid it could introduce potential deadlock risks due to its > involvement with journal operations), and we can't just simply try again. > If we remove the ZEROOUT logic, we may lose our last line of defense > during the I/O completion. > > Currently, I am considering whether it is possible to completely remove > EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before > submitting I/Os; instead, all splitting would be performed when > converting extents to written after the I/O completes. Based on my patch, > "ext4: use reserved metadata blocks when splitting extent on endio"[2], > and the ZEROOUT logic, this approach appears feasible, and xfstest-bld > shows no regressions. > > So I think the ZEROOUT logic remains somewhat useful until we find better > solution(e.g., making more precise reservations for metadata). Perhaps we > can refactor both the split extent and ZEROOUT logic to make them more > concise. Hi Yi, Okay it makese sense to keep the zeroout if iomap path is planning to shift the extent splitting to endio. Plus, I agree based on the comments in the ext4_convert_unwritte_extents_endio() that we might even today need to split in endio (although i cant recall when this happens) which would need a zeroout fallback. And yes, I think refactoring the whole logic to be less confusing would be better. I had an older unposted untested patch cleaning up some of this, I was looking at it again today and there seems to be a lot of cleanups we can do here but that becomes out of scope of this patchset I believe. Sure then, lets keep it as it is for now. I'll review the changes you made and later I can post a patch refactoring this area. Regards, ojaswin > > [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/ > [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/ > > Cheers, > Yi. > > > Thanks, > > Ojaswin > > > >> EXT4_EXT_MAY_ZEROOUT when splitting extent. > >> - Add patches 06-07, fix two stale extent status entries problems also > >> caused by splitting extent. > >> - Modify patches 08-10, extend __es_remove_extent() and > >> ext4_es_cache_extent() to allow them to overwrite existing extents of > >> the same status when caching on-disk extents, while also checking > >> extents of different stauts and raising alarms to prevent misuse. > >> - Add patch 13 to clear the usage of ext4_es_insert_extent(), and > >> remove the TODO comment in it. > >> > >> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/ > >> > >> Original Description > >> > >> This series addresses the optimization that Jan pointed out [1] > >> regarding the introduction of a sequence number to > >> ext4_es_insert_extent(). The proposal is to replace all instances where > >> the cache of on-disk extents is updated by using ext4_es_cache_extent() > >> instead of ext4_es_insert_extent(). This change can prevent excessive > >> cache invalidations caused by unnecessarily increasing the extent > >> sequence number when reading from the on-disk extent tree. > >> > >> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/ > >> > >> Cheers, > >> Yi. > >> > >> Zhang Yi (13): > >> ext4: cleanup zeroout in ext4_split_extent_at() > >> ext4: subdivide EXT4_EXT_DATA_VALID1 > >> ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 > >> ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before > >> submitting I/O > >> ext4: correct the mapping status if the extent has been zeroed > >> ext4: don't cache extent during splitting extent > >> ext4: drop extent cache before splitting extent > >> ext4: cleanup useless out tag in __es_remove_extent() > >> ext4: make __es_remove_extent() check extent status > >> ext4: make ext4_es_cache_extent() support overwrite existing extents > >> ext4: adjust the debug info in ext4_es_cache_extent() > >> ext4: replace ext4_es_insert_extent() when caching on-disk extents > >> ext4: drop the TODO comment in ext4_es_insert_extent() > >> > >> fs/ext4/extents.c | 127 +++++++++++++++++++++++---------------- > >> fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++--------- > >> fs/ext4/inode.c | 18 +++--- > >> 3 files changed, 176 insertions(+), 90 deletions(-) > >> > >> -- > >> 2.46.1 > >> >
On 11/24/2025 8:50 PM, Ojaswin Mujoo wrote: > On Mon, Nov 24, 2025 at 01:04:04PM +0800, Zhang Yi wrote: >> Hi, Ojaswin! >> >> On 11/23/2025 6:55 PM, Ojaswin Mujoo wrote: >>> On Fri, Nov 21, 2025 at 02:07:58PM +0800, Zhang Yi wrote: >>>> Changes since v1: >>>> - Rebase the codes based on the latest linux-next 20251120. >>>> - Add patches 01-05, fix two stale data problems caused by >>> >>> Hi Zhang, thanks for the patches. >>> >> >> Thank you for take time to look at this series. >> >>> I've always felt uncomfortable with the ZEROOUT code here because it >>> seems to have many such bugs as you pointed out in the series. Its very >>> fragile and the bugs are easy to miss behind all the data valid and >>> split flags mess. >>> >> >> Yes, I agree with you. The implementation of EXT4_EXT_MAY_ZEROOUT has >> significantly increased the complexity of split extents and the >> potential for bugs. >> >>> As per my understanding, ZEROOUT logic seems to be a special best-effort >>> try to make the split/convert operation "work" when dealing with >>> transient errors like ENOSPC etc. I was just wondering if it makes sense >>> to just get rid of the whole ZEROOUT logic completely and just reset the >>> extent to orig state if there is any error. This allows us to get rid of >>> DATA_VALID* flags as well and makes the whole ext4_split_convert_extents() >>> slightly less messy. >>> >>> Maybe we can have a retry loop at the top level caller if we want to try >>> again for say ENOSPC or ENOMEM. >>> >>> Would love to hear your thoughts on it. >>> >> >> I think this is a direction worth exploring. However, what I am >> currently considering is that we need to address this scenario of >> splitting extent during the I/O completion. Although the ZEROOUT logic >> is fragile and has many issues recently, it currently serves as a >> fallback solution for handling ENOSPC errors that arise when splitting >> extents during I/O completion. It ensures that I/O operations do not >> fail due to insufficient extent blocks. >> >> Please see ext4_convert_unwritten_extents_endio(). Although we have made >> our best effort to tried to split extents using >> EXT4_GET_BLOCKS_IO_CREATE_EXT before issuing I/Os, we still have not >> covered all scenarios. Moreover, after converting the buffered I/O path >> to the iomap infrastructure in the future, we may need to split extents >> during the I/O completion worker[1]. >> >> In most block allocation processes, we already have a retry loop to deal >> with ENOSPC or ENOMEM, such as ext4_should_retry_alloc(). However, it >> doesn't seem appropriate to place this logic into the I/O completion >> handling process (I haven't thought this solution through deeply yet, >> but I'm afraid it could introduce potential deadlock risks due to its >> involvement with journal operations), and we can't just simply try again. >> If we remove the ZEROOUT logic, we may lose our last line of defense >> during the I/O completion. >> >> Currently, I am considering whether it is possible to completely remove >> EXT4_GET_BLOCKS_IO_CREATE_EXT so that extents are not split before >> submitting I/Os; instead, all splitting would be performed when >> converting extents to written after the I/O completes. Based on my patch, >> "ext4: use reserved metadata blocks when splitting extent on endio"[2], >> and the ZEROOUT logic, this approach appears feasible, and xfstest-bld >> shows no regressions. >> >> So I think the ZEROOUT logic remains somewhat useful until we find better >> solution(e.g., making more precise reservations for metadata). Perhaps we >> can refactor both the split extent and ZEROOUT logic to make them more >> concise. > > Hi Yi, > > Okay it makese sense to keep the zeroout if iomap path is planning to > shift the extent splitting to endio. Plus, I agree based on the comments > in the ext4_convert_unwritte_extents_endio() that we might even today > need to split in endio (although i cant recall when this happens) which > would need a zeroout fallback. A relatively common case is the concurrency between folio write-back and fallocate. After an unwritten extent is allocated during the writeback process, if fallocate is performed before the I/O completes, the unwritten extent created by fallocate will merge with the writeback portion. Consequently, a split operation is required once the I/O completes. > > And yes, I think refactoring the whole logic to be less confusing would > be better. I had an older unposted untested patch cleaning up some of > this, I was looking at it again today and there seems to be a lot of > cleanups we can do here but that becomes out of scope of this patchset I > believe. > > Sure then, lets keep it as it is for now. I'll review the changes you > made and later I can post a patch refactoring this area. OK, thank you a lot for your review and look forward to your patch. Thanks, Yi. > > Regards, > ojaswin > >> >> [1] https://lore.kernel.org/linux-ext4/20241022111059.2566137-18-yi.zhang@huaweicloud.com/ >> [2] https://lore.kernel.org/linux-ext4/20241022111059.2566137-12-yi.zhang@huaweicloud.com/ >> >> Cheers, >> Yi. >> >>> Thanks, >>> Ojaswin >>> >>>> EXT4_EXT_MAY_ZEROOUT when splitting extent. >>>> - Add patches 06-07, fix two stale extent status entries problems also >>>> caused by splitting extent. >>>> - Modify patches 08-10, extend __es_remove_extent() and >>>> ext4_es_cache_extent() to allow them to overwrite existing extents of >>>> the same status when caching on-disk extents, while also checking >>>> extents of different stauts and raising alarms to prevent misuse. >>>> - Add patch 13 to clear the usage of ext4_es_insert_extent(), and >>>> remove the TODO comment in it. >>>> >>>> v1: https://lore.kernel.org/linux-ext4/20251031062905.4135909-1-yi.zhang@huaweicloud.com/ >>>> >>>> Original Description >>>> >>>> This series addresses the optimization that Jan pointed out [1] >>>> regarding the introduction of a sequence number to >>>> ext4_es_insert_extent(). The proposal is to replace all instances where >>>> the cache of on-disk extents is updated by using ext4_es_cache_extent() >>>> instead of ext4_es_insert_extent(). This change can prevent excessive >>>> cache invalidations caused by unnecessarily increasing the extent >>>> sequence number when reading from the on-disk extent tree. >>>> >>>> [1] https://lore.kernel.org/linux-ext4/ympvfypw3222g2k4xzd5pba4zhkz5jihw4td67iixvrqhuu43y@wse63ntv4s6u/ >>>> >>>> Cheers, >>>> Yi. >>>> >>>> Zhang Yi (13): >>>> ext4: cleanup zeroout in ext4_split_extent_at() >>>> ext4: subdivide EXT4_EXT_DATA_VALID1 >>>> ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 >>>> ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before >>>> submitting I/O >>>> ext4: correct the mapping status if the extent has been zeroed >>>> ext4: don't cache extent during splitting extent >>>> ext4: drop extent cache before splitting extent >>>> ext4: cleanup useless out tag in __es_remove_extent() >>>> ext4: make __es_remove_extent() check extent status >>>> ext4: make ext4_es_cache_extent() support overwrite existing extents >>>> ext4: adjust the debug info in ext4_es_cache_extent() >>>> ext4: replace ext4_es_insert_extent() when caching on-disk extents >>>> ext4: drop the TODO comment in ext4_es_insert_extent() >>>> >>>> fs/ext4/extents.c | 127 +++++++++++++++++++++++---------------- >>>> fs/ext4/extents_status.c | 121 ++++++++++++++++++++++++++++--------- >>>> fs/ext4/inode.c | 18 +++--- >>>> 3 files changed, 176 insertions(+), 90 deletions(-) >>>> >>>> -- >>>> 2.46.1 >>>> >>
© 2016 - 2025 Red Hat, Inc.