[v2] ext4: replace ext4_es_insert_extent() when caching on-disk extents

[PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Zhang Yi 2 months, 2 weeks ago

From: Zhang Yi <yi.zhang@huawei.com>

When allocating initialized blocks from a large unwritten extent, or
when splitting an unwritten extent during end I/O and converting it to
initialized, there is currently a potential issue of stale data if the
extent needs to be split in the middle.

       0  A      B  N
       [UUUUUUUUUUUU]    U: unwritten extent
       [--DDDDDDDD--]    D: valid data
          |<-  ->| ----> this range needs to be initialized

ext4_split_extent() first try to split this extent at B with
EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
ext4_split_extent_at() failed to split this extent due to temporary lack
of space. It zeroout B to N and mark the entire extent from 0 to N
as written.

       0  A      B  N
       [WWWWWWWWWWWW]    W: written extent
       [SSDDDDDDDDZZ]    Z: zeroed, S: stale data

ext4_split_extent() then try to split this extent at A with
EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
a stale written extent from 0 to A.

       0  A      B   N
       [WW|WWWWWWWWWW]
       [SS|DDDDDDDDZZ]

Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
when splitting at B, don't convert the entire extent to written and left
it as unwritten after zeroing out B to N. The remaining work is just
like the standard two-part split. ext4_split_extent() will pass the
EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
second time, allowing it to properly handle the split. If the split is
successful, it will keep extent from 0 to A as unwritten.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/extents.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f7aa497e5d6c..cafe66cb562f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
 		err = ext4_ext_zeroout(inode, &zero_ex);
 		if (err)
 			goto fix_extent_len;
+		/*
+		 * The first half contains partially valid data, the splitting
+		 * of this extent has not been completed, fix extent length
+		 * and ext4_split_extent() split will the first half again.
+		 */
+		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
+			goto fix_extent_len;
 
 		/* update the extent length and mark as initialized */
 		ex->ee_len = cpu_to_le16(ee_len);
@@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
 			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
 				       EXT4_EXT_MARK_UNWRIT2;
 		if (split_flag & EXT4_EXT_DATA_VALID2)
-			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
+			split_flag1 |= map->m_lblk > ee_block ?
+				       EXT4_EXT_DATA_PARTIAL_VALID1 :
+				       EXT4_EXT_DATA_ENTIRE_VALID1;
 		path = ext4_split_extent_at(handle, inode, path,
 				map->m_lblk + map->m_len, split_flag1, flags1);
 		if (IS_ERR(path))
-- 
2.46.1

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Jan Kara 2 months, 2 weeks ago

On Fri 21-11-25 14:08:01, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When allocating initialized blocks from a large unwritten extent, or
> when splitting an unwritten extent during end I/O and converting it to
> initialized, there is currently a potential issue of stale data if the
> extent needs to be split in the middle.
> 
>        0  A      B  N
>        [UUUUUUUUUUUU]    U: unwritten extent
>        [--DDDDDDDD--]    D: valid data
>           |<-  ->| ----> this range needs to be initialized
> 
> ext4_split_extent() first try to split this extent at B with
> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> ext4_split_extent_at() failed to split this extent due to temporary lack
> of space. It zeroout B to N and mark the entire extent from 0 to N
> as written.
> 
>        0  A      B  N
>        [WWWWWWWWWWWW]    W: written extent
>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
> 
> ext4_split_extent() then try to split this extent at A with
> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
> a stale written extent from 0 to A.
> 
>        0  A      B   N
>        [WW|WWWWWWWWWW]
>        [SS|DDDDDDDDZZ]
> 
> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
> when splitting at B, don't convert the entire extent to written and left
> it as unwritten after zeroing out B to N. The remaining work is just
> like the standard two-part split. ext4_split_extent() will pass the
> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
> second time, allowing it to properly handle the split. If the split is
> successful, it will keep extent from 0 to A as unwritten.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Good catch on the data exposure issue! First I'd like to discuss whether
there isn't a way to fix these problems in a way that doesn't make the
already complex code even more complex. My observation is that
EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
in ext4_split_convert_extents() which both call ext4_split_extent(). The
actual extent zeroing happens in ext4_split_extent_at() and in
ext4_ext_convert_to_initialized(). I think the code would be much clearer
if we just centralized all the zeroing in ext4_split_extent(). At that
place the situation is actually pretty simple:

1) 'ex' is unwritten, 'map' describes part with already written data which
we want to convert to initialized (generally IO completion situation) => we
can zero out boundaries if they are smaller than max_zeroout or if extent
split fails.

2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
submission) => the split is opportunistic here, if we cannot split due to
ENOSPC, just go on and deal with it at IO completion time. No zeroing
needed.

3) 'ex' is written, 'map' describes part that should be converted to
unwritten => we can zero out the 'map' part if smaller than max_zeroout or
if extent split fails.

This should all result in a relatively straightforward code where we can
distinguish the three cases based on 'ex' and passed flags, we should be
able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
the data exposure issues at the same time. What do you think? Am I missing
some case?

								Honza

> ---
>  fs/ext4/extents.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index f7aa497e5d6c..cafe66cb562f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  		err = ext4_ext_zeroout(inode, &zero_ex);
>  		if (err)
>  			goto fix_extent_len;
> +		/*
> +		 * The first half contains partially valid data, the splitting
> +		 * of this extent has not been completed, fix extent length
> +		 * and ext4_split_extent() split will the first half again.
> +		 */
> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
> +			goto fix_extent_len;
>  
>  		/* update the extent length and mark as initialized */
>  		ex->ee_len = cpu_to_le16(ee_len);
> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>  				       EXT4_EXT_MARK_UNWRIT2;
>  		if (split_flag & EXT4_EXT_DATA_VALID2)
> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> +			split_flag1 |= map->m_lblk > ee_block ?
> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>  		path = ext4_split_extent_at(handle, inode, path,
>  				map->m_lblk + map->m_len, split_flag1, flags1);
>  		if (IS_ERR(path))
> -- 
> 2.46.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Ojaswin Mujoo 2 months, 1 week ago

On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> On Fri 21-11-25 14:08:01, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > When allocating initialized blocks from a large unwritten extent, or
> > when splitting an unwritten extent during end I/O and converting it to
> > initialized, there is currently a potential issue of stale data if the
> > extent needs to be split in the middle.
> > 
> >        0  A      B  N
> >        [UUUUUUUUUUUU]    U: unwritten extent
> >        [--DDDDDDDD--]    D: valid data
> >           |<-  ->| ----> this range needs to be initialized
> > 
> > ext4_split_extent() first try to split this extent at B with
> > EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> > ext4_split_extent_at() failed to split this extent due to temporary lack
> > of space. It zeroout B to N and mark the entire extent from 0 to N
> > as written.
> > 
> >        0  A      B  N
> >        [WWWWWWWWWWWW]    W: written extent
> >        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
> > 
> > ext4_split_extent() then try to split this extent at A with
> > EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
> > a stale written extent from 0 to A.
> > 
> >        0  A      B   N
> >        [WW|WWWWWWWWWW]
> >        [SS|DDDDDDDDZZ]
> > 
> > Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
> > when splitting at B, don't convert the entire extent to written and left
> > it as unwritten after zeroing out B to N. The remaining work is just
> > like the standard two-part split. ext4_split_extent() will pass the
> > EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
> > second time, allowing it to properly handle the split. If the split is
> > successful, it will keep extent from 0 to A as unwritten.
> > 
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Good catch on the data exposure issue! First I'd like to discuss whether
> there isn't a way to fix these problems in a way that doesn't make the
> already complex code even more complex. My observation is that
> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> actual extent zeroing happens in ext4_split_extent_at() and in
> ext4_ext_convert_to_initialized(). I think the code would be much clearer
> if we just centralized all the zeroing in ext4_split_extent(). At that
> place the situation is actually pretty simple:

This is exactly what I was playing with in my local tree to refactor this
particular part of code :). I agree that ext4_split_extent() is a much
better place to do the zeroout and it looks much cleaner but I agree
with Yi that it might be better to do it after fixing the stale
exposures so backports are straight forward. 

Am I correct in understanding that you are suggesting to zeroout
proactively if we are below max_zeroout before even trying to extent
split (which seems be done in ext4_ext_convert_to_initialized() as well)?

In this case, I have 2 concerns:

> 
> 1) 'ex' is unwritten, 'map' describes part with already written data which
> we want to convert to initialized (generally IO completion situation) => we
> can zero out boundaries if they are smaller than max_zeroout or if extent
> split fails.

Firstly, I know you mentioned in another email that zeroout of small ranges
gives us a performance win but is it really faster on average than
extent manipulation?

For example, for case 1 where both zeroout and splitting need
journalling, I understand that splitting has high journal overhead in worst case,
where tree might grow, but more often than not we would be manipulating
within the same leaf so journalling only 1 bh (same as zeroout). In which case
seems like zeroout might be slower no matter how fast the IO can be
done. So proactive zeroout might be for beneficial for case 3 than case
1.

> 
> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> submission) => the split is opportunistic here, if we cannot split due to
> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> needed.
> 
> 3) 'ex' is written, 'map' describes part that should be converted to
> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> if extent split fails.

Proactive zeroout before trying split does seem benficial to help us
avoid journal overhead for split. However, judging from
ext4_ext_convert_to_initialized(), max zeroout comes from
sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
the IO device, so that means theres a chance a zeroout might be pretty
slow if say we are doing it on a device than doesn't support accelerated
zeroout operations. Maybe we need to be more intelligent in setting
s_extent_max_zeroout_kb?

> 
> This should all result in a relatively straightforward code where we can
> distinguish the three cases based on 'ex' and passed flags, we should be
> able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> the data exposure issues at the same time. What do you think? Am I missing
> some case?
> 
> 								Honza
> 
> > ---
> >  fs/ext4/extents.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index f7aa497e5d6c..cafe66cb562f 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
> >  		err = ext4_ext_zeroout(inode, &zero_ex);
> >  		if (err)
> >  			goto fix_extent_len;
> > +		/*
> > +		 * The first half contains partially valid data, the splitting
> > +		 * of this extent has not been completed, fix extent length
> > +		 * and ext4_split_extent() split will the first half again.
> > +		 */
> > +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
> > +			goto fix_extent_len;
> >  
> >  		/* update the extent length and mark as initialized */
> >  		ex->ee_len = cpu_to_le16(ee_len);
> > @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
> >  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
> >  				       EXT4_EXT_MARK_UNWRIT2;
> >  		if (split_flag & EXT4_EXT_DATA_VALID2)
> > -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> > +			split_flag1 |= map->m_lblk > ee_block ?
> > +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
> > +				       EXT4_EXT_DATA_ENTIRE_VALID1;
> >  		path = ext4_split_extent_at(handle, inode, path,
> >  				map->m_lblk + map->m_len, split_flag1, flags1);
> >  		if (IS_ERR(path))
> > -- 
> > 2.46.1
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Jan Kara 2 months, 1 week ago

On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> > Good catch on the data exposure issue! First I'd like to discuss whether
> > there isn't a way to fix these problems in a way that doesn't make the
> > already complex code even more complex. My observation is that
> > EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> > in ext4_split_convert_extents() which both call ext4_split_extent(). The
> > actual extent zeroing happens in ext4_split_extent_at() and in
> > ext4_ext_convert_to_initialized(). I think the code would be much clearer
> > if we just centralized all the zeroing in ext4_split_extent(). At that
> > place the situation is actually pretty simple:
> 
> This is exactly what I was playing with in my local tree to refactor this
> particular part of code :). I agree that ext4_split_extent() is a much
> better place to do the zeroout and it looks much cleaner but I agree
> with Yi that it might be better to do it after fixing the stale
> exposures so backports are straight forward. 
> 
> Am I correct in understanding that you are suggesting to zeroout
> proactively if we are below max_zeroout before even trying to extent
> split (which seems be done in ext4_ext_convert_to_initialized() as well)?

Yes. I was suggesting to effectively keep the behavior from
ext4_ext_convert_to_initialized().

> In this case, I have 2 concerns:
> 
> > 
> > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > we want to convert to initialized (generally IO completion situation) => we
> > can zero out boundaries if they are smaller than max_zeroout or if extent
> > split fails.
> 
> Firstly, I know you mentioned in another email that zeroout of small ranges
> gives us a performance win but is it really faster on average than
> extent manipulation?

I guess it depends on the storage and the details of the extent tree. But
it definitely does help in cases like when you have large unwritten extent
and then start writing randomly 4k blocks into it because this zeroout
logic effectively limits the fragmentation of the extent tree. Overall
sequentially writing a few blocks more of zeros is very cheap practically
with any storage while fragmenting the extent tree becomes expensive rather
quickly (you generally get deeper extent tree due to smaller extents etc.).

> For example, for case 1 where both zeroout and splitting need
> journalling, I understand that splitting has high journal overhead in worst case,
> where tree might grow, but more often than not we would be manipulating
> within the same leaf so journalling only 1 bh (same as zeroout). In which case
> seems like zeroout might be slower no matter how fast the IO can be
> done. So proactive zeroout might be for beneficial for case 3 than case
> 1.

I agree that initially while the split extents still fit into the same leaf
block, zero out is likely to be somewhat slower but over the longer term
the gains from less extent fragmentation win.

> > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > submission) => the split is opportunistic here, if we cannot split due to
> > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > needed.
> > 
> > 3) 'ex' is written, 'map' describes part that should be converted to
> > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > if extent split fails.
> 
> Proactive zeroout before trying split does seem benficial to help us
> avoid journal overhead for split. However, judging from
> ext4_ext_convert_to_initialized(), max zeroout comes from
> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> the IO device, so that means theres a chance a zeroout might be pretty
> slow if say we are doing it on a device than doesn't support accelerated
> zeroout operations. Maybe we need to be more intelligent in setting
> s_extent_max_zeroout_kb?

You can also tune the value in sysfs. I'm not 100% sure how the kernel
could do a better guess. Also I think 32k works mostly because it is small
enough to be cheap to write but already large enough to noticeably reduce
fragmentation for some pathological workloads (you can easily get 1/4 of
the extents than without this logic). But I'm open to ideas if you have
some.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Andreas Dilger 2 months, 1 week ago


> On Nov 28, 2025, at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
>> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
>>> Good catch on the data exposure issue! First I'd like to discuss whether
>>> there isn't a way to fix these problems in a way that doesn't make the
>>> already complex code even more complex. My observation is that
>>> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
>>> in ext4_split_convert_extents() which both call ext4_split_extent(). The
>>> actual extent zeroing happens in ext4_split_extent_at() and in
>>> ext4_ext_convert_to_initialized(). I think the code would be much clearer
>>> if we just centralized all the zeroing in ext4_split_extent(). At that
>>> place the situation is actually pretty simple:
>> 
>> This is exactly what I was playing with in my local tree to refactor this
>> particular part of code :). I agree that ext4_split_extent() is a much
>> better place to do the zeroout and it looks much cleaner but I agree
>> with Yi that it might be better to do it after fixing the stale
>> exposures so backports are straight forward. 
>> 
>> Am I correct in understanding that you are suggesting to zeroout
>> proactively if we are below max_zeroout before even trying to extent
>> split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> 
> Yes. I was suggesting to effectively keep the behavior from
> ext4_ext_convert_to_initialized().
> 
>> In this case, I have 2 concerns:
>> 
>>> 
>>> 1) 'ex' is unwritten, 'map' describes part with already written data which
>>> we want to convert to initialized (generally IO completion situation) => we
>>> can zero out boundaries if they are smaller than max_zeroout or if extent
>>> split fails.
>> 
>> Firstly, I know you mentioned in another email that zeroout of small ranges
>> gives us a performance win but is it really faster on average than
>> extent manipulation?
> 
> I guess it depends on the storage and the details of the extent tree. But
> it definitely does help in cases like when you have large unwritten extent
> and then start writing randomly 4k blocks into it because this zeroout
> logic effectively limits the fragmentation of the extent tree. Overall
> sequentially writing a few blocks more of zeros is very cheap practically
> with any storage while fragmenting the extent tree becomes expensive rather
> quickly (you generally get deeper extent tree due to smaller extents etc.).

The zeroout logic is not primarily an issue with the extent tree complexity.
I agree with Ojaswin that in the common case the extent split would not
cause a new index block to be written, though it can become unwieldy in the
extreme case.

As Jan wrote, the main performance win is to avoid writing a bunch of
small discontiguous blocks.  For HDD *and* flash, the overhead of writing
several separate small blocks is much higher than writing a single 32KiB
or 64KiB block to the storage.  Multiple separate blocks means more items
in the queue and submitted to storage, separate seeks on an HDD and/or read-
modify-write on a RAID controller, or erase blocks on a flash device.

It also defers the conversion of those unwritten extents to a later time,
when they would need to be processed again anyway if the blocks were written.

I was also considering whether the unwritten blocks would save on reads,
but I suspect that would not be the case either.  Doing sequential reads
would need to submit multiple small reads to the device and then zeroout
the unwritten blocks instead of a single 32KiB or 64KiB read (which is
basically free once the request is processed.

> 
>> For example, for case 1 where both zeroout and splitting need
>> journalling, I understand that splitting has high journal overhead in
>> worst case, where tree might grow, but more often than not we would be
>> manipulating within the same leaf so journalling only 1 bh (same as
>> zeroout). In which case seems like zeroout might be slower no matter
>> how fast the IO can be done. So proactive zeroout might be for beneficial
>> for case 3 than case 1.
> 
> I agree that initially while the split extents still fit into the same leaf
> block, zero out is likely to be somewhat slower but over the longer term
> the gains from less extent fragmentation win.

I doubt that writing a single contiguous 32KiB chunk is ever going to be
slower than writing 2 or 3 separate 4KiB chunks to the storage.  _Maybe_
if it was NVRAM, but I don't think that would be the common case?

>>> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
>>> submission) => the split is opportunistic here, if we cannot split due to
>>> ENOSPC, just go on and deal with it at IO completion time. No zeroing
>>> needed.
>>> 
>>> 3) 'ex' is written, 'map' describes part that should be converted to
>>> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
>>> if extent split fails.
>> 
>> Proactive zeroout before trying split does seem benficial to help us
>> avoid journal overhead for split. However, judging from
>> ext4_ext_convert_to_initialized(), max zeroout comes from
>> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
>> the IO device, so that means theres a chance a zeroout might be pretty
>> slow if say we are doing it on a device than doesn't support accelerated
>> zeroout operations. Maybe we need to be more intelligent in setting
>> s_extent_max_zeroout_kb?
> 
> You can also tune the value in sysfs. I'm not 100% sure how the kernel
> could do a better guess. Also I think 32k works mostly because it is small
> enough to be cheap to write but already large enough to noticeably reduce
> fragmentation for some pathological workloads (you can easily get 1/4 of
> the extents than without this logic). But I'm open to ideas if you have
> some.

Aligning this size with the flash erase block size might be a win?
It may be that 32KiB is still large enough today (I've heard of 16KiB
sector flash devices arriving soon, and IIRC 64KiB sectors are the
norm for HDDs if anyone still cares).  Having this tuned automatically
by the physical device characteristics (like max(32KiB, sector size) or
similar if the flash erase block size is available somehow in the kernel)
would future proof this as device sizes continue to grow.


Cheers, Andreas

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Ojaswin Mujoo 2 months, 1 week ago

On Fri, Nov 28, 2025 at 12:52:30PM -0700, Andreas Dilger wrote:
> 
> 
> > On Nov 28, 2025, at 4:14 AM, Jan Kara <jack@suse.cz> wrote:
> > 
> > On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> >> On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> >>> Good catch on the data exposure issue! First I'd like to discuss whether
> >>> there isn't a way to fix these problems in a way that doesn't make the
> >>> already complex code even more complex. My observation is that
> >>> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> >>> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> >>> actual extent zeroing happens in ext4_split_extent_at() and in
> >>> ext4_ext_convert_to_initialized(). I think the code would be much clearer
> >>> if we just centralized all the zeroing in ext4_split_extent(). At that
> >>> place the situation is actually pretty simple:
> >> 
> >> This is exactly what I was playing with in my local tree to refactor this
> >> particular part of code :). I agree that ext4_split_extent() is a much
> >> better place to do the zeroout and it looks much cleaner but I agree
> >> with Yi that it might be better to do it after fixing the stale
> >> exposures so backports are straight forward. 
> >> 
> >> Am I correct in understanding that you are suggesting to zeroout
> >> proactively if we are below max_zeroout before even trying to extent
> >> split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> > 
> > Yes. I was suggesting to effectively keep the behavior from
> > ext4_ext_convert_to_initialized().
> > 
> >> In this case, I have 2 concerns:
> >> 
> >>> 
> >>> 1) 'ex' is unwritten, 'map' describes part with already written data which
> >>> we want to convert to initialized (generally IO completion situation) => we
> >>> can zero out boundaries if they are smaller than max_zeroout or if extent
> >>> split fails.
> >> 
> >> Firstly, I know you mentioned in another email that zeroout of small ranges
> >> gives us a performance win but is it really faster on average than
> >> extent manipulation?
> > 
> > I guess it depends on the storage and the details of the extent tree. But
> > it definitely does help in cases like when you have large unwritten extent
> > and then start writing randomly 4k blocks into it because this zeroout
> > logic effectively limits the fragmentation of the extent tree. Overall
> > sequentially writing a few blocks more of zeros is very cheap practically
> > with any storage while fragmenting the extent tree becomes expensive rather
> > quickly (you generally get deeper extent tree due to smaller extents etc.).
> 
> The zeroout logic is not primarily an issue with the extent tree complexity.
> I agree with Ojaswin that in the common case the extent split would not
> cause a new index block to be written, though it can become unwieldy in the
> extreme case.
> 
> As Jan wrote, the main performance win is to avoid writing a bunch of
> small discontiguous blocks.  For HDD *and* flash, the overhead of writing
> several separate small blocks is much higher than writing a single 32KiB
> or 64KiB block to the storage.  Multiple separate blocks means more items
> in the queue and submitted to storage, separate seeks on an HDD and/or read-
> modify-write on a RAID controller, or erase blocks on a flash device.

Okay makes sense, so basically zeroout helps in the long run by reducing
fragmentation.

> 
> It also defers the conversion of those unwritten extents to a later time,
> when they would need to be processed again anyway if the blocks were written.
> 
> I was also considering whether the unwritten blocks would save on reads,
> but I suspect that would not be the case either.  Doing sequential reads
> would need to submit multiple small reads to the device and then zeroout
> the unwritten blocks instead of a single 32KiB or 64KiB read (which is
> basically free once the request is processed.
> 
> > 
> >> For example, for case 1 where both zeroout and splitting need
> >> journalling, I understand that splitting has high journal overhead in
> >> worst case, where tree might grow, but more often than not we would be
> >> manipulating within the same leaf so journalling only 1 bh (same as
> >> zeroout). In which case seems like zeroout might be slower no matter
> >> how fast the IO can be done. So proactive zeroout might be for beneficial
> >> for case 3 than case 1.
> > 
> > I agree that initially while the split extents still fit into the same leaf
> > block, zero out is likely to be somewhat slower but over the longer term
> > the gains from less extent fragmentation win.
> 
> I doubt that writing a single contiguous 32KiB chunk is ever going to be
> slower than writing 2 or 3 separate 4KiB chunks to the storage.  _Maybe_
> if it was NVRAM, but I don't think that would be the common case?
> 
> >>> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> >>> submission) => the split is opportunistic here, if we cannot split due to
> >>> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> >>> needed.
> >>> 
> >>> 3) 'ex' is written, 'map' describes part that should be converted to
> >>> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> >>> if extent split fails.
> >> 
> >> Proactive zeroout before trying split does seem benficial to help us
> >> avoid journal overhead for split. However, judging from
> >> ext4_ext_convert_to_initialized(), max zeroout comes from
> >> sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> >> the IO device, so that means theres a chance a zeroout might be pretty
> >> slow if say we are doing it on a device than doesn't support accelerated
> >> zeroout operations. Maybe we need to be more intelligent in setting
> >> s_extent_max_zeroout_kb?
> > 
> > You can also tune the value in sysfs. I'm not 100% sure how the kernel
> > could do a better guess. Also I think 32k works mostly because it is small
> > enough to be cheap to write but already large enough to noticeably reduce
> > fragmentation for some pathological workloads (you can easily get 1/4 of
> > the extents than without this logic). But I'm open to ideas if you have
> > some.
> 
> Aligning this size with the flash erase block size might be a win?
> It may be that 32KiB is still large enough today (I've heard of 16KiB
> sector flash devices arriving soon, and IIRC 64KiB sectors are the
> norm for HDDs if anyone still cares).  Having this tuned automatically
> by the physical device characteristics (like max(32KiB, sector size) or
> similar if the flash erase block size is available somehow in the kernel)
> would future proof this as device sizes continue to grow.

okay so do you think it makes sense to consider the
max_write_zeroes_sectors() value of the underlying device. Enterprise
NVMEs and SCSI disks can utilize WRITE_ZEROES or WRITE_SAME which can
allow us to potentially zeroout much larger ranges with minimal overhead
of data movement. Maybe we can leverage these to zeroout more
aggressively without much of a performance penalty.

Let me try to see if i can find a disk and try a POC for this.

Regards,
ojaswin

> 
> 
> Cheers, Andreas
> 
> 
> 
> 
>

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Ojaswin Mujoo 2 months, 1 week ago

On Fri, Nov 28, 2025 at 12:14:56PM +0100, Jan Kara wrote:
> On Fri 28-11-25 12:58:22, Ojaswin Mujoo wrote:
> > On Thu, Nov 27, 2025 at 02:41:52PM +0100, Jan Kara wrote:
> > > Good catch on the data exposure issue! First I'd like to discuss whether
> > > there isn't a way to fix these problems in a way that doesn't make the
> > > already complex code even more complex. My observation is that
> > > EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> > > in ext4_split_convert_extents() which both call ext4_split_extent(). The
> > > actual extent zeroing happens in ext4_split_extent_at() and in
> > > ext4_ext_convert_to_initialized(). I think the code would be much clearer
> > > if we just centralized all the zeroing in ext4_split_extent(). At that
> > > place the situation is actually pretty simple:
> > 
> > This is exactly what I was playing with in my local tree to refactor this
> > particular part of code :). I agree that ext4_split_extent() is a much
> > better place to do the zeroout and it looks much cleaner but I agree
> > with Yi that it might be better to do it after fixing the stale
> > exposures so backports are straight forward. 
> > 
> > Am I correct in understanding that you are suggesting to zeroout
> > proactively if we are below max_zeroout before even trying to extent
> > split (which seems be done in ext4_ext_convert_to_initialized() as well)?
> 
> Yes. I was suggesting to effectively keep the behavior from
> ext4_ext_convert_to_initialized().
> 
> > In this case, I have 2 concerns:
> > 
> > > 
> > > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > > we want to convert to initialized (generally IO completion situation) => we
> > > can zero out boundaries if they are smaller than max_zeroout or if extent
> > > split fails.
> > 
> > Firstly, I know you mentioned in another email that zeroout of small ranges
> > gives us a performance win but is it really faster on average than
> > extent manipulation?
> 
> I guess it depends on the storage and the details of the extent tree. But
> it definitely does help in cases like when you have large unwritten extent
> and then start writing randomly 4k blocks into it because this zeroout
> logic effectively limits the fragmentation of the extent tree. Overall
> sequentially writing a few blocks more of zeros is very cheap practically
> with any storage while fragmenting the extent tree becomes expensive rather
> quickly (you generally get deeper extent tree due to smaller extents etc.).

Got it, makes sense. This approach is definitely worth a try and then
maybe we can run a few benchmarks and see how proactive zeroout works.

> 
> > For example, for case 1 where both zeroout and splitting need
> > journalling, I understand that splitting has high journal overhead in worst case,
> > where tree might grow, but more often than not we would be manipulating
> > within the same leaf so journalling only 1 bh (same as zeroout). In which case
> > seems like zeroout might be slower no matter how fast the IO can be
> > done. So proactive zeroout might be for beneficial for case 3 than case
> > 1.
> 
> I agree that initially while the split extents still fit into the same leaf
> block, zero out is likely to be somewhat slower but over the longer term
> the gains from less extent fragmentation win.
> 
> > > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > > submission) => the split is opportunistic here, if we cannot split due to
> > > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > > needed.
> > > 
> > > 3) 'ex' is written, 'map' describes part that should be converted to
> > > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > > if extent split fails.
> > 
> > Proactive zeroout before trying split does seem benficial to help us
> > avoid journal overhead for split. However, judging from
> > ext4_ext_convert_to_initialized(), max zeroout comes from
> > sbi->s_extent_max_zeroout_kb which is hardcoded to 32 irrespective of
> > the IO device, so that means theres a chance a zeroout might be pretty
> > slow if say we are doing it on a device than doesn't support accelerated
> > zeroout operations. Maybe we need to be more intelligent in setting
> > s_extent_max_zeroout_kb?
> 
> You can also tune the value in sysfs. I'm not 100% sure how the kernel

Yeah but I feel the average users dont usually tune sysfs values

> could do a better guess. Also I think 32k works mostly because it is small
> enough to be cheap to write but already large enough to noticeably reduce
> fragmentation for some pathological workloads (you can easily get 1/4 of
> the extents than without this logic). But I'm open to ideas if you have
> some.

Yes, Im yet to check in depth about it but the block layer does advertise a
max_write_zeroes_sectors() which might help us tune the value a better,
and perhaps even support higher zeroout without losing performance. I'll
look into it a bit more.

Regards,
ojaswin

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Zhang Yi 2 months, 1 week ago

On 11/27/2025 9:41 PM, Jan Kara wrote:
> On Fri 21-11-25 14:08:01, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When allocating initialized blocks from a large unwritten extent, or
>> when splitting an unwritten extent during end I/O and converting it to
>> initialized, there is currently a potential issue of stale data if the
>> extent needs to be split in the middle.
>>
>>        0  A      B  N
>>        [UUUUUUUUUUUU]    U: unwritten extent
>>        [--DDDDDDDD--]    D: valid data
>>           |<-  ->| ----> this range needs to be initialized
>>
>> ext4_split_extent() first try to split this extent at B with
>> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
>> ext4_split_extent_at() failed to split this extent due to temporary lack
>> of space. It zeroout B to N and mark the entire extent from 0 to N
>> as written.
>>
>>        0  A      B  N
>>        [WWWWWWWWWWWW]    W: written extent
>>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
>>
>> ext4_split_extent() then try to split this extent at A with
>> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
>> a stale written extent from 0 to A.
>>
>>        0  A      B   N
>>        [WW|WWWWWWWWWW]
>>        [SS|DDDDDDDDZZ]
>>
>> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
>> when splitting at B, don't convert the entire extent to written and left
>> it as unwritten after zeroing out B to N. The remaining work is just
>> like the standard two-part split. ext4_split_extent() will pass the
>> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
>> second time, allowing it to properly handle the split. If the split is
>> successful, it will keep extent from 0 to A as unwritten.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Good catch on the data exposure issue! First I'd like to discuss whether
> there isn't a way to fix these problems in a way that doesn't make the
> already complex code even more complex. My observation is that
> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> actual extent zeroing happens in ext4_split_extent_at() and in
> ext4_ext_convert_to_initialized().

Yes.

> I think the code would be much clearer
> if we just centralized all the zeroing in ext4_split_extent(). At that
> place the situation is actually pretty simple:

Thank you for your suggestion!

> 
> 1) 'ex' is unwritten, 'map' describes part with already written data which
> we want to convert to initialized (generally IO completion situation) => we
> can zero out boundaries if they are smaller than max_zeroout or if extent
> split fails.
> 

Yes. Agree.

> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> submission) => the split is opportunistic here, if we cannot split due to
> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> needed.

Yes. At the same time, if we can indeed move the entire split unwritten
operation to be handled after I/O completion in the future, it would also be
more convenient to remove this segment of logic.

> 
> 3) 'ex' is written, 'map' describes part that should be converted to
> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> if extent split fails.

This makes sense to me! This case it originates from the fallocate with Zero
Range operation. Currently, the zero-out operation will not be performed if
the split operation fails, instead, it immediately returns a failure.

I agree with you that we can do zero out if the 'map' part smaller than
max_zeroout instead of split extents. However, if the 'map' part is bigger
than max_zeroout and if extent split fails, I don't think zero out is a good
idea, Because it might cause zero-range calls to take a long time to execute.
Although fallocate doesn't explicitly specify how ZERO_RANGE should be
implemented, users expect it to be very fast. Therefore, in this case, if the
split fails, it would be better to simply return an error, leave things as
they are. What do you think?

> 
> This should all result in a relatively straightforward code where we can
> distinguish the three cases based on 'ex' and passed flags, we should be
> able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> the data exposure issues at the same time. What do you think? Am I missing
> some case?
> 

Indeed, I think the overall solution is a nice cleanup idea. :-)
But this would involve a significant amount of refactoring and logical changes.
Could we first merge the current set of patches(it could be more easier to
backport to the early LTS version), and then I can start a new series to
address this optimization?

Cheers,
Yi.

> 								Honza
> 
>> ---
>>  fs/ext4/extents.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index f7aa497e5d6c..cafe66cb562f 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>  		err = ext4_ext_zeroout(inode, &zero_ex);
>>  		if (err)
>>  			goto fix_extent_len;
>> +		/*
>> +		 * The first half contains partially valid data, the splitting
>> +		 * of this extent has not been completed, fix extent length
>> +		 * and ext4_split_extent() split will the first half again.
>> +		 */
>> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
>> +			goto fix_extent_len;
>>  
>>  		/* update the extent length and mark as initialized */
>>  		ex->ee_len = cpu_to_le16(ee_len);
>> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>>  				       EXT4_EXT_MARK_UNWRIT2;
>>  		if (split_flag & EXT4_EXT_DATA_VALID2)
>> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
>> +			split_flag1 |= map->m_lblk > ee_block ?
>> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
>> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>>  		path = ext4_split_extent_at(handle, inode, path,
>>  				map->m_lblk + map->m_len, split_flag1, flags1);
>>  		if (IS_ERR(path))
>> -- 
>> 2.46.1
>>

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Jan Kara 2 months, 1 week ago

On Fri 28-11-25 11:45:51, Zhang Yi wrote:
> On 11/27/2025 9:41 PM, Jan Kara wrote:
> > I think the code would be much clearer
> > if we just centralized all the zeroing in ext4_split_extent(). At that
> > place the situation is actually pretty simple:
> 
> Thank you for your suggestion!
> 
> > 
> > 1) 'ex' is unwritten, 'map' describes part with already written data which
> > we want to convert to initialized (generally IO completion situation) => we
> > can zero out boundaries if they are smaller than max_zeroout or if extent
> > split fails.
> > 
> 
> Yes. Agree.
> 
> > 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> > submission) => the split is opportunistic here, if we cannot split due to
> > ENOSPC, just go on and deal with it at IO completion time. No zeroing
> > needed.
> 
> Yes. At the same time, if we can indeed move the entire split unwritten
> operation to be handled after I/O completion in the future, it would also be
> more convenient to remove this segment of logic.

Yes.

> > 3) 'ex' is written, 'map' describes part that should be converted to
> > unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> > if extent split fails.
> 
> This makes sense to me! This case it originates from the fallocate with Zero
> Range operation. Currently, the zero-out operation will not be performed if
> the split operation fails, instead, it immediately returns a failure.
> 
> I agree with you that we can do zero out if the 'map' part smaller than
> max_zeroout instead of split extents. However, if the 'map' part is bigger
> than max_zeroout and if extent split fails, I don't think zero out is a good
> idea, Because it might cause zero-range calls to take a long time to execute.
> Although fallocate doesn't explicitly specify how ZERO_RANGE should be
> implemented, users expect it to be very fast. Therefore, in this case, if the
> split fails, it would be better to simply return an error, leave things as
> they are. What do you think?

True. Just returning the error is a good option in this case.

> > This should all result in a relatively straightforward code where we can
> > distinguish the three cases based on 'ex' and passed flags, we should be
> > able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> > drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> > the data exposure issues at the same time. What do you think? Am I missing
> > some case?
> 
> Indeed, I think the overall solution is a nice cleanup idea. :-)
> But this would involve a significant amount of refactoring and logical changes.
> Could we first merge the current set of patches(it could be more easier to
> backport to the early LTS version), and then I can start a new series to
> address this optimization?

I agree the changes are rather intrusive and the code is complex. Since you
have direct fixes already written, let's merge them first and cleanup
afterwards as you suggest.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Ojaswin Mujoo 2 months, 2 weeks ago

On Fri, Nov 21, 2025 at 02:08:01PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When allocating initialized blocks from a large unwritten extent, or
> when splitting an unwritten extent during end I/O and converting it to
> initialized, there is currently a potential issue of stale data if the
> extent needs to be split in the middle.
> 
>        0  A      B  N
>        [UUUUUUUUUUUU]    U: unwritten extent
>        [--DDDDDDDD--]    D: valid data
>           |<-  ->| ----> this range needs to be initialized
> 
> ext4_split_extent() first try to split this extent at B with
> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
> ext4_split_extent_at() failed to split this extent due to temporary lack
> of space. It zeroout B to N and mark the entire extent from 0 to N
> as written.
> 
>        0  A      B  N
>        [WWWWWWWWWWWW]    W: written extent
>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
> 
> ext4_split_extent() then try to split this extent at A with
> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
> a stale written extent from 0 to A.
> 
>        0  A      B   N
>        [WW|WWWWWWWWWW]
>        [SS|DDDDDDDDZZ]
> 
> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
> when splitting at B, don't convert the entire extent to written and left
> it as unwritten after zeroing out B to N. The remaining work is just
> like the standard two-part split. ext4_split_extent() will pass the
> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
> second time, allowing it to properly handle the split. If the split is
> successful, it will keep extent from 0 to A as unwritten.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Hi Yi,

This patch looks good to me. I'm just wondering since this is a stale
data exposure that might need a backport, should we add a Fixes: tag 
and also keep these fixes before the refactor in 1/13 so backport is
easier.

Other than that, feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin

> ---
>  fs/ext4/extents.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index f7aa497e5d6c..cafe66cb562f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>  		err = ext4_ext_zeroout(inode, &zero_ex);
>  		if (err)
>  			goto fix_extent_len;
> +		/*
> +		 * The first half contains partially valid data, the splitting
> +		 * of this extent has not been completed, fix extent length
> +		 * and ext4_split_extent() split will the first half again.
> +		 */
> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
> +			goto fix_extent_len;
>  
>  		/* update the extent length and mark as initialized */
>  		ex->ee_len = cpu_to_le16(ee_len);
> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>  				       EXT4_EXT_MARK_UNWRIT2;
>  		if (split_flag & EXT4_EXT_DATA_VALID2)
> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
> +			split_flag1 |= map->m_lblk > ee_block ?
> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>  		path = ext4_split_extent_at(handle, inode, path,
>  				map->m_lblk + map->m_len, split_flag1, flags1);
>  		if (IS_ERR(path))
> -- 
> 2.46.1
>

Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1

Posted by Zhang Yi 2 months, 2 weeks ago

On 11/26/2025 7:29 PM, Ojaswin Mujoo wrote:
> On Fri, Nov 21, 2025 at 02:08:01PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When allocating initialized blocks from a large unwritten extent, or
>> when splitting an unwritten extent during end I/O and converting it to
>> initialized, there is currently a potential issue of stale data if the
>> extent needs to be split in the middle.
>>
>>        0  A      B  N
>>        [UUUUUUUUUUUU]    U: unwritten extent
>>        [--DDDDDDDD--]    D: valid data
>>           |<-  ->| ----> this range needs to be initialized
>>
>> ext4_split_extent() first try to split this extent at B with
>> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
>> ext4_split_extent_at() failed to split this extent due to temporary lack
>> of space. It zeroout B to N and mark the entire extent from 0 to N
>> as written.
>>
>>        0  A      B  N
>>        [WWWWWWWWWWWW]    W: written extent
>>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
>>
>> ext4_split_extent() then try to split this extent at A with
>> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
>> a stale written extent from 0 to A.
>>
>>        0  A      B   N
>>        [WW|WWWWWWWWWW]
>>        [SS|DDDDDDDDZZ]
>>
>> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
>> when splitting at B, don't convert the entire extent to written and left
>> it as unwritten after zeroing out B to N. The remaining work is just
>> like the standard two-part split. ext4_split_extent() will pass the
>> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
>> second time, allowing it to properly handle the split. If the split is
>> successful, it will keep extent from 0 to A as unwritten.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Hi Yi,
> 
> This patch looks good to me. I'm just wondering since this is a stale
> data exposure that might need a backport, should we add a Fixes: tag 
> and also keep these fixes before the refactor in 1/13 so backport is
> easier.

Sure, I can move patch 01 after all the fix patches.

Thanks,
Yi.

> 
> Other than that, feel free to add:
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Regards,
> ojaswin
> 
>> ---
>>  fs/ext4/extents.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index f7aa497e5d6c..cafe66cb562f 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>  		err = ext4_ext_zeroout(inode, &zero_ex);
>>  		if (err)
>>  			goto fix_extent_len;
>> +		/*
>> +		 * The first half contains partially valid data, the splitting
>> +		 * of this extent has not been completed, fix extent length
>> +		 * and ext4_split_extent() split will the first half again.
>> +		 */
>> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
>> +			goto fix_extent_len;
>>  
>>  		/* update the extent length and mark as initialized */
>>  		ex->ee_len = cpu_to_le16(ee_len);
>> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>>  				       EXT4_EXT_MARK_UNWRIT2;
>>  		if (split_flag & EXT4_EXT_DATA_VALID2)
>> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
>> +			split_flag1 |= map->m_lblk > ee_block ?
>> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
>> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>>  		path = ext4_split_extent_at(handle, inode, path,
>>  				map->m_lblk + map->m_len, split_flag1, flags1);
>>  		if (IS_ERR(path))
>> -- 
>> 2.46.1
>>