From: Baokun Li <libaokun1@huawei.com>
If we mount an ext4 fs with data_err=abort option, it should abort on
file data write error. But if the extent is unwritten, we won't add a
JI_WAIT_DATA bit to the inode, so jbd2 won't wait for the inode's data
to be written back and check the inode mapping for errors. The data
writeback failures are not sensed unless the log is watched or fsync
is called.
Therefore, when data_err=abort is enabled, the journal is aborted when
an I/O error is detected in ext4_end_io_end() to make users who are
concerned about the contents of the file happy.
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
fs/ext4/page-io.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 6054ec27fb48..058bf4660d7b 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -175,6 +175,7 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
{
struct inode *inode = io_end->inode;
handle_t *handle = io_end->handle;
+ struct super_block *sb = inode->i_sb;
int ret = 0;
ext4_debug("ext4_end_io_nolock: io_end 0x%p from inode %lu,list->next 0x%p,"
@@ -190,11 +191,15 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
ret = -EIO;
if (handle)
jbd2_journal_free_reserved(handle);
+ if (test_opt(sb, DATA_ERR_ABORT) &&
+ !ext4_is_quota_file(inode) &&
+ ext4_should_order_data(inode))
+ jbd2_journal_abort(EXT4_SB(sb)->s_journal, ret);
} else {
ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
}
- if (ret < 0 && !ext4_forced_shutdown(inode->i_sb)) {
- ext4_msg(inode->i_sb, KERN_EMERG,
+ if (ret < 0 && !ext4_forced_shutdown(sb)) {
+ ext4_msg(sb, KERN_EMERG,
"failed to convert unwritten extents to written "
"extents -- potential data loss! "
"(inode %lu, error %d)", inode->i_ino, ret);
--
2.46.1
On Fri 20-12-24 14:07:55, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> If we mount an ext4 fs with data_err=abort option, it should abort on
> file data write error. But if the extent is unwritten, we won't add a
> JI_WAIT_DATA bit to the inode, so jbd2 won't wait for the inode's data
> to be written back and check the inode mapping for errors. The data
> writeback failures are not sensed unless the log is watched or fsync
> is called.
>
> Therefore, when data_err=abort is enabled, the journal is aborted when
> an I/O error is detected in ext4_end_io_end() to make users who are
> concerned about the contents of the file happy.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
I'm not opposed to this change but I think we should better define the
expectations around data_err=abort. For example the dependency on
data=ordered is kind of strange and the current semantics of data_err=abort
are hard to understand for admins (since they are mostly implementation
defined). For example if IO error happens on data overwrites, the
filesystem will not be aborted because we don't bother tracking such data
as ordered (for performance reasons). Since you've apparently talked to people
using this option: What are their expectations about the option?
Honza
> ---
> fs/ext4/page-io.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 6054ec27fb48..058bf4660d7b 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -175,6 +175,7 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
> {
> struct inode *inode = io_end->inode;
> handle_t *handle = io_end->handle;
> + struct super_block *sb = inode->i_sb;
> int ret = 0;
>
> ext4_debug("ext4_end_io_nolock: io_end 0x%p from inode %lu,list->next 0x%p,"
> @@ -190,11 +191,15 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
> ret = -EIO;
> if (handle)
> jbd2_journal_free_reserved(handle);
> + if (test_opt(sb, DATA_ERR_ABORT) &&
> + !ext4_is_quota_file(inode) &&
> + ext4_should_order_data(inode))
> + jbd2_journal_abort(EXT4_SB(sb)->s_journal, ret);
> } else {
> ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
> }
> - if (ret < 0 && !ext4_forced_shutdown(inode->i_sb)) {
> - ext4_msg(inode->i_sb, KERN_EMERG,
> + if (ret < 0 && !ext4_forced_shutdown(sb)) {
> + ext4_msg(sb, KERN_EMERG,
> "failed to convert unwritten extents to written "
> "extents -- potential data loss! "
> "(inode %lu, error %d)", inode->i_ino, ret);
> --
> 2.46.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 2024/12/20 18:36, Jan Kara wrote: > On Fri 20-12-24 14:07:55, libaokun@huaweicloud.com wrote: >> From: Baokun Li <libaokun1@huawei.com> >> >> If we mount an ext4 fs with data_err=abort option, it should abort on >> file data write error. But if the extent is unwritten, we won't add a >> JI_WAIT_DATA bit to the inode, so jbd2 won't wait for the inode's data >> to be written back and check the inode mapping for errors. The data >> writeback failures are not sensed unless the log is watched or fsync >> is called. >> >> Therefore, when data_err=abort is enabled, the journal is aborted when >> an I/O error is detected in ext4_end_io_end() to make users who are >> concerned about the contents of the file happy. >> >> Signed-off-by: Baokun Li <libaokun1@huawei.com> Hi Honza, Thank you for your review and feedback! > I'm not opposed to this change but I think we should better define the > expectations around data_err=abort. Totally agree, the definition of this option is a bit vague right now. It's semantics have changed implicitly with iterations of the version. Originally in v2.6.28-rc1 commit 5bf5683a33f3 (“ext4: add an option to control error handling on file data”) introduced “data_err=abort”, the implementation of this mount option relies on JBD2_ABORT_ON_ SYNCDATA_ERR, and this flag takes effect when the journal_finish_inode_data_buffers() function returns an error. At this point in ext4_write_end(), in ordered mode, we add the inode to the ordered data list, whether it is an append write or an overwrite write. Therefore all write failures in ordered mode will abort the journal. This is also the semantics in the documentation - “Abort the journal if an error occurs in a file data buffer in ordered mode.”. Until commit 06bd3c36a733 (“ext4: fix data exposure after a crash”) in v4.7-rc1, in order to avoid stale data, we will only add inodes to the ordered data list when attaching freshly allocated blocks to inode using a written extent. Since then, only written write (aka dioread_lock) failures in ordered mode will abort the journal, and “data_err=abort” in unwritten mode will no longer take effect. There are more historical changes to the relevant logic, so please correct me if I'm missing something. > For example the dependency on > data=ordered is kind of strange and the current semantics of data_err=abort > are hard to understand for admins (since they are mostly implementation > defined). For example if IO error happens on data overwrites, the > filesystem will not be aborted because we don't bother tracking such data > as ordered (for performance reasons). Since you've apparently talked to people > using this option: What are their expectations about the option? > > Honza As was the original intent of introducing "data_err=abort", users who use this option are concerned about corruption of critical data spreading silently, that is, they are concerned that the data actually read does not match the data written. But as you said, we don't track overwrite writes for performance reasons. But compared to the poor performance of journal_data and the risk of the drop cache exposing stale, not being able to sense data errors on overwrite writes is acceptable. After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache or remount, the user will not see the unexpected all-zero data in the unwritten area, but rather the earlier consistent data, and the data in the file is trustworthy, at the cost of some trailing data. On the other hand, adding a new written extents and converting an unwritten extents to written both expose the data to the user, so the user is concerned about whether the data is correct at that point. In general, I think we can update the semantics of “data_err=abort” to, “Abort the journal if the file fails to write back data on extended writes in ORDERED mode”. Do you have any thoughts on this? Thanks, Baokun
Hello!
On Fri 20-12-24 21:39:39, Baokun Li wrote:
> On 2024/12/20 18:36, Jan Kara wrote:
> > On Fri 20-12-24 14:07:55, libaokun@huaweicloud.com wrote:
> > > From: Baokun Li <libaokun1@huawei.com>
> > >
> > > If we mount an ext4 fs with data_err=abort option, it should abort on
> > > file data write error. But if the extent is unwritten, we won't add a
> > > JI_WAIT_DATA bit to the inode, so jbd2 won't wait for the inode's data
> > > to be written back and check the inode mapping for errors. The data
> > > writeback failures are not sensed unless the log is watched or fsync
> > > is called.
> > >
> > > Therefore, when data_err=abort is enabled, the journal is aborted when
> > > an I/O error is detected in ext4_end_io_end() to make users who are
> > > concerned about the contents of the file happy.
> > >
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
>
> Thank you for your review and feedback!
> > I'm not opposed to this change but I think we should better define the
> > expectations around data_err=abort.
> Totally agree, the definition of this option is a bit vague right now.
> It's semantics have changed implicitly with iterations of the version.
>
> Originally in v2.6.28-rc1 commit 5bf5683a33f3 (“ext4: add an option to
> control error handling on file data”) introduced “data_err=abort”, the
> implementation of this mount option relies on JBD2_ABORT_ON_ SYNCDATA_ERR,
> and this flag takes effect when the journal_finish_inode_data_buffers()
> function returns an error. At this point in ext4_write_end(), in ordered
> mode, we add the inode to the ordered data list, whether it is an append
> write or an overwrite write. Therefore all write failures in ordered mode
> will abort the journal. This is also the semantics in the documentation
> - “Abort the journal if an error occurs in a file data buffer in ordered
> mode.”.
Well, that is not quite true. Normally, we run in delalloc mode and use
ext4_da_write_end() to finish writes. Thus normally inode was not added to
the transaction's list of inodes to flush (since 3.8 where this behavior
got implemented by commit f3b59291a69d ("ext4: remove calls to
ext4_jbd2_file_inode() from delalloc write path")). Then the commit
06bd3c36a733 (“ext4: fix data exposure after a crash”) in 4.7 realized this
is broken and fixed things to properly flush blocks when needed.
Actually the data=ordered mode always guaranteed we will not expose stale
data but never guaranteed all the written data will be flushed. Thus
data_err=abort always controlled "what should jbd2 do when it spots error
when flushing data" rather than any kind of guarantee that IO error on any
data writeback results in filesystem abort. After all page writeback can
easily try to flush the data before a transaction commit and hit IO error
and jbd2 then won't notice the problem (the page will be clean already) and
it was always like that.
> > For example the dependency on
> > data=ordered is kind of strange and the current semantics of data_err=abort
> > are hard to understand for admins (since they are mostly implementation
> > defined). For example if IO error happens on data overwrites, the
> > filesystem will not be aborted because we don't bother tracking such data
> > as ordered (for performance reasons). Since you've apparently talked to people
> > using this option: What are their expectations about the option?
>
> As was the original intent of introducing "data_err=abort", users who
> use this option are concerned about corruption of critical data spreading
> silently, that is, they are concerned that the data actually read does
> not match the data written.
OK, so you really want any write IO error to result in filesystem abort?
Both page writeback and direct IO writes?
> But as you said, we don't track overwrite writes for performance reasons.
> But compared to the poor performance of journal_data and the risk of the
> drop cache exposing stale, not being able to sense data errors on overwrite
> writes is acceptable.
>
> After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache
> or remount, the user will not see the unexpected all-zero data in the
> unwritten area, but rather the earlier consistent data, and the data in
> the file is trustworthy, at the cost of some trailing data.
>
> On the other hand, adding a new written extents and converting an
> unwritten extents to written both expose the data to the user, so the user
> is concerned about whether the data is correct at that point.
>
> In general, I think we can update the semantics of “data_err=abort” to,
> “Abort the journal if the file fails to write back data on extended writes
> in ORDERED mode”. Do you have any thoughts on this?
I agree it makes sense to make the semantics of data_err=abort more
obvious. Based on the usecase you've described - i.e., rather take the
filesystem down on write IO error than risk returning old data later - it
would make sense to me to also do this on direct IO writes. Also I would do
this regardless of data=writeback/ordered/journalled mode because although
users wanting data_err=abort behavior will also likely want the guarantees
of data=ordered mode, these are two different things and I can imagine use
cases for setups with data=writeback and data_err=abort as well (e.g. for
scratch filesystems which get recreated on each system startup).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 2025/1/6 22:32, Jan Kara wrote:
> Hello!
>
> On Fri 20-12-24 21:39:39, Baokun Li wrote:
>> On 2024/12/20 18:36, Jan Kara wrote:
>>> On Fri 20-12-24 14:07:55, libaokun@huaweicloud.com wrote:
>>>> From: Baokun Li <libaokun1@huawei.com>
>>>>
>>>> If we mount an ext4 fs with data_err=abort option, it should abort on
>>>> file data write error. But if the extent is unwritten, we won't add a
>>>> JI_WAIT_DATA bit to the inode, so jbd2 won't wait for the inode's data
>>>> to be written back and check the inode mapping for errors. The data
>>>> writeback failures are not sensed unless the log is watched or fsync
>>>> is called.
>>>>
>>>> Therefore, when data_err=abort is enabled, the journal is aborted when
>>>> an I/O error is detected in ext4_end_io_end() to make users who are
>>>> concerned about the contents of the file happy.
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> Thank you for your review and feedback!
>>> I'm not opposed to this change but I think we should better define the
>>> expectations around data_err=abort.
>> Totally agree, the definition of this option is a bit vague right now.
>> It's semantics have changed implicitly with iterations of the version.
>>
>> Originally in v2.6.28-rc1 commit 5bf5683a33f3 (“ext4: add an option to
>> control error handling on file data”) introduced “data_err=abort”, the
>> implementation of this mount option relies on JBD2_ABORT_ON_ SYNCDATA_ERR,
>> and this flag takes effect when the journal_finish_inode_data_buffers()
>> function returns an error. At this point in ext4_write_end(), in ordered
>> mode, we add the inode to the ordered data list, whether it is an append
>> write or an overwrite write. Therefore all write failures in ordered mode
>> will abort the journal. This is also the semantics in the documentation
>> - “Abort the journal if an error occurs in a file data buffer in ordered
>> mode.”.
> Well, that is not quite true. Normally, we run in delalloc mode and use
> ext4_da_write_end() to finish writes. Thus normally inode was not added to
> the transaction's list of inodes to flush (since 3.8 where this behavior
> got implemented by commit f3b59291a69d ("ext4: remove calls to
> ext4_jbd2_file_inode() from delalloc write path")). Then the commit
> 06bd3c36a733 (“ext4: fix data exposure after a crash”) in 4.7 realized this
> is broken and fixed things to properly flush blocks when needed.
Yes, we inadvertently changed the behavior of "data_err=abort" when
fixing the bug. The implicit dependency between "data_err=abort" and
ext4_jbd2_file_inode() makes it hard to spot this.
> Actually the data=ordered mode always guaranteed we will not expose stale
> data but never guaranteed all the written data will be flushed.
Yes, compared to the data=writeback mode, the semantics of data=ordered can
guarantee that stale data will not be exposed.
> Thus
> data_err=abort always controlled "what should jbd2 do when it spots error
> when flushing data" rather than any kind of guarantee that IO error on any
> data writeback results in filesystem abort.
I think this is the initial design problem of data_err=abort. The
description in its commit is to abort the journal when file data
corruption is detected, because not all applications frequently
check for errors in files through fsync.
As for why it is only in data=ordered mode, I personally guess that
in data=journal mode, file data is added to the journal, and the
journal itself will be aborted when data is abnormal; and people who
care about file data often do not use data=writeback mode, which may
expose stale data.
> After all page writeback can
> easily try to flush the data before a transaction commit and hit IO error
> and jbd2 then won't notice the problem (the page will be clean already) and
> it was always like that.
Good point! "data_err=abort" did have this problem before. If the
relevant metadata cache has been cleaned, we will not be able to
perceive any errors, and fsync will not work either. This also
shows that checking at IO completion is a better choice.
>>> For example the dependency on
>>> data=ordered is kind of strange and the current semantics of data_err=abort
>>> are hard to understand for admins (since they are mostly implementation
>>> defined). For example if IO error happens on data overwrites, the
>>> filesystem will not be aborted because we don't bother tracking such data
>>> as ordered (for performance reasons). Since you've apparently talked to people
>>> using this option: What are their expectations about the option?
>> As was the original intent of introducing "data_err=abort", users who
>> use this option are concerned about corruption of critical data spreading
>> silently, that is, they are concerned that the data actually read does
>> not match the data written.
> OK, so you really want any write IO error to result in filesystem abort?
> Both page writeback and direct IO writes?
Direct I/O writes are okay because the inode size is updated after all
work is completed, and users will know immediately whether the write has
actually landed on disk and take corresponding actions.
Buffered I/O writes return success after copying data to memory, and users
cannot perceive whether the writeback is successful unless using fsync, so
they may not perceive when there is an abnormality in the data on the disk.
Therefore, IMO, data_err=abort only cares about page writeback.
>> But as you said, we don't track overwrite writes for performance reasons.
>> But compared to the poor performance of journal_data and the risk of the
>> drop cache exposing stale, not being able to sense data errors on overwrite
>> writes is acceptable.
>>
>> After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache
>> or remount, the user will not see the unexpected all-zero data in the
>> unwritten area, but rather the earlier consistent data, and the data in
>> the file is trustworthy, at the cost of some trailing data.
>>
>> On the other hand, adding a new written extents and converting an
>> unwritten extents to written both expose the data to the user, so the user
>> is concerned about whether the data is correct at that point.
>>
>> In general, I think we can update the semantics of “data_err=abort” to,
>> “Abort the journal if the file fails to write back data on extended writes
>> in ORDERED mode”. Do you have any thoughts on this?
> I agree it makes sense to make the semantics of data_err=abort more
> obvious. Based on the usecase you've described - i.e., rather take the
> filesystem down on write IO error than risk returning old data later - it
> would make sense to me to also do this on direct IO writes.
Okay, I will update the semantics of data_err=abort in the next version.
For direct I/O writes, I think we don't need it because users can
perceive errors in time.
> Also I would do
> this regardless of data=writeback/ordered/journalled mode because although
> users wanting data_err=abort behavior will also likely want the guarantees
> of data=ordered mode, these are two different things
For data=journal mode, the journal itself will abort when data is abnormal.
However, as you pointed out, the above bug may cause errors to be missed.
Therefore, we can perform this check by default for journaled files.
> and I can imagine use
> cases for setups with data=writeback and data_err=abort as well (e.g. for
> scratch filesystems which get recreated on each system startup).
>
> Honza
Users using data=writeback often do not care about data consistency.
I did not understand your example. Could you please explain it in detail?
Thank you in advance.
Cheers,
Baokun
On Wed 08-01-25 11:43:08, Baokun Li wrote: > On 2025/1/6 22:32, Jan Kara wrote: > > > But as you said, we don't track overwrite writes for performance reasons. > > > But compared to the poor performance of journal_data and the risk of the > > > drop cache exposing stale, not being able to sense data errors on overwrite > > > writes is acceptable. > > > > > > After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache > > > or remount, the user will not see the unexpected all-zero data in the > > > unwritten area, but rather the earlier consistent data, and the data in > > > the file is trustworthy, at the cost of some trailing data. > > > > > > On the other hand, adding a new written extents and converting an > > > unwritten extents to written both expose the data to the user, so the user > > > is concerned about whether the data is correct at that point. > > > > > > In general, I think we can update the semantics of “data_err=abort” to, > > > “Abort the journal if the file fails to write back data on extended writes > > > in ORDERED mode”. Do you have any thoughts on this? > > I agree it makes sense to make the semantics of data_err=abort more > > obvious. Based on the usecase you've described - i.e., rather take the > > filesystem down on write IO error than risk returning old data later - it > > would make sense to me to also do this on direct IO writes. > > Okay, I will update the semantics of data_err=abort in the next version. > For direct I/O writes, I think we don't need it because users can > perceive errors in time. So I agree that direct IO users will generally notice the IO error so the chances for bugs due to missing the IO error is low. But I think the question is really the other way around: Is there a good reason to make direct IO writes different? Because if I as a sysadmin want to secure a system from IO error handling bugs, then having to think whether some application uses direct IO or not is another nuissance. Why should I be bothered? > > Also I would do > > this regardless of data=writeback/ordered/journalled mode because although > > users wanting data_err=abort behavior will also likely want the guarantees > > of data=ordered mode, these are two different things > For data=journal mode, the journal itself will abort when data is abnormal. > However, as you pointed out, the above bug may cause errors to be missed. > Therefore, we can perform this check by default for journaled files. > > and I can imagine use > > cases for setups with data=writeback and data_err=abort as well (e.g. for > > scratch filesystems which get recreated on each system startup). > > Users using data=writeback often do not care about data consistency. > I did not understand your example. Could you please explain it in detail? Well, they don't care about data consistency after a crash. But they usually do care about data consistency while the system is running. And unhandled IO errors can lead to data consistency problems without crashing the system (for example if writeback fails and page gets evicted from memory later, you have lost the new data and may see old version of it). And I see data_err=abort as a way to say: "I don't trust my applications to handle IO errors well. Rather take the filesystem down in that case than risk data consistency issues". Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
Hello! On 2025/1/8 21:43, Jan Kara wrote: > On Wed 08-01-25 11:43:08, Baokun Li wrote: >> On 2025/1/6 22:32, Jan Kara wrote: >>>> But as you said, we don't track overwrite writes for performance reasons. >>>> But compared to the poor performance of journal_data and the risk of the >>>> drop cache exposing stale, not being able to sense data errors on overwrite >>>> writes is acceptable. >>>> >>>> After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache >>>> or remount, the user will not see the unexpected all-zero data in the >>>> unwritten area, but rather the earlier consistent data, and the data in >>>> the file is trustworthy, at the cost of some trailing data. >>>> >>>> On the other hand, adding a new written extents and converting an >>>> unwritten extents to written both expose the data to the user, so the user >>>> is concerned about whether the data is correct at that point. >>>> >>>> In general, I think we can update the semantics of “data_err=abort” to, >>>> “Abort the journal if the file fails to write back data on extended writes >>>> in ORDERED mode”. Do you have any thoughts on this? >>> I agree it makes sense to make the semantics of data_err=abort more >>> obvious. Based on the usecase you've described - i.e., rather take the >>> filesystem down on write IO error than risk returning old data later - it >>> would make sense to me to also do this on direct IO writes. >> Okay, I will update the semantics of data_err=abort in the next version. >> For direct I/O writes, I think we don't need it because users can >> perceive errors in time. > So I agree that direct IO users will generally notice the IO error so the > chances for bugs due to missing the IO error is low. But I think the > question is really the other way around: Is there a good reason to make > direct IO writes different? Because if I as a sysadmin want to secure a > system from IO error handling bugs, then having to think whether some > application uses direct IO or not is another nuissance. Why should I be > bothered? This is not quite right. Regardless of whether it is a BIO write or a DIO write, users will check the return value of the write operation, because errors can occur not only when data is written to disk. It's just that when a DIO write returns successfully, users can be sure that the data has been written to the disk. However, when a BIO write returns successfully, it only means that the data has been copied into the buffer. Whether it has been successfully written back to the disk is unknown to the user. That's why we need data_err=abort to ensure that users are aware when the page writeback fails and to prevent data corruption from spreading. >>> Also I would do >>> this regardless of data=writeback/ordered/journalled mode because although >>> users wanting data_err=abort behavior will also likely want the guarantees >>> of data=ordered mode, these are two different things >> For data=journal mode, the journal itself will abort when data is abnormal. >> However, as you pointed out, the above bug may cause errors to be missed. >> Therefore, we can perform this check by default for journaled files. >>> and I can imagine use >>> cases for setups with data=writeback and data_err=abort as well (e.g. for >>> scratch filesystems which get recreated on each system startup). >> Users using data=writeback often do not care about data consistency. >> I did not understand your example. Could you please explain it in detail? > Well, they don't care about data consistency after a crash. But they > usually do care about data consistency while the system is running. And > unhandled IO errors can lead to data consistency problems without crashing > the system (for example if writeback fails and page gets evicted from > memory later, you have lost the new data and may see old version of it). I see your point. I concur that it is indeed meaningful for data_err=abort to be supported in data=writeback mode. Thank you for your explanation! > And I see data_err=abort as a way to say: "I don't trust my applications to > handle IO errors well. Rather take the filesystem down in that case than > risk data consistency issues". > > Honza I still prefer to think of this as a supplement for users not being able to perceive page writeback in a timely manner. The fsync operation is complex, requires frequent waiting, and may have omissions. In addition, because ext4_end_bio() runs in interrupt context, we can't abort the journal directly there due to potential locking issues. Instead, we now add write-back error checks and journal abortion logic to ext4_end_io_end(), which is called by a kworker during unwritten extent conversion. Consequently, for modes that don't support unwritten extents (e.g., nodelalloc, journal_data, see ext4_should_dioread_nolock()), only the check in journal_submit_data_buffers() will be effective. Should we call the kworker for all files in ext4_end_bio()? Thanks again! Regards, Baokun
On Wed 08-01-25 22:44:42, Baokun Li wrote:
> On 2025/1/8 21:43, Jan Kara wrote:
> > On Wed 08-01-25 11:43:08, Baokun Li wrote:
> > > On 2025/1/6 22:32, Jan Kara wrote:
> > > > > But as you said, we don't track overwrite writes for performance reasons.
> > > > > But compared to the poor performance of journal_data and the risk of the
> > > > > drop cache exposing stale, not being able to sense data errors on overwrite
> > > > > writes is acceptable.
> > > > >
> > > > > After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache
> > > > > or remount, the user will not see the unexpected all-zero data in the
> > > > > unwritten area, but rather the earlier consistent data, and the data in
> > > > > the file is trustworthy, at the cost of some trailing data.
> > > > >
> > > > > On the other hand, adding a new written extents and converting an
> > > > > unwritten extents to written both expose the data to the user, so the user
> > > > > is concerned about whether the data is correct at that point.
> > > > >
> > > > > In general, I think we can update the semantics of “data_err=abort” to,
> > > > > “Abort the journal if the file fails to write back data on extended writes
> > > > > in ORDERED mode”. Do you have any thoughts on this?
> > > > I agree it makes sense to make the semantics of data_err=abort more
> > > > obvious. Based on the usecase you've described - i.e., rather take the
> > > > filesystem down on write IO error than risk returning old data later - it
> > > > would make sense to me to also do this on direct IO writes.
> > > Okay, I will update the semantics of data_err=abort in the next version.
> > > For direct I/O writes, I think we don't need it because users can
> > > perceive errors in time.
> > So I agree that direct IO users will generally notice the IO error so the
> > chances for bugs due to missing the IO error is low. But I think the
> > question is really the other way around: Is there a good reason to make
> > direct IO writes different? Because if I as a sysadmin want to secure a
> > system from IO error handling bugs, then having to think whether some
> > application uses direct IO or not is another nuissance. Why should I be
> > bothered?
> This is not quite right. Regardless of whether it is a BIO write or a DIO
> write, users will check the return value of the write operation, because
> errors can occur not only when data is written to disk.
Yes, they *should* check the return value of write(2) and take appropriate
action. But do all of them check and mainly do they do something meaningful
with the error? That's what I'm not so sure about :).
> It's just that when a DIO write returns successfully, users can be sure
> that the data has been written to the disk.
>
> However, when a BIO write returns successfully, it only means that the
> data has been copied into the buffer. Whether it has been successfully
> written back to the disk is unknown to the user.
>
> That's why we need data_err=abort to ensure that users are aware when the
> page writeback fails and to prevent data corruption from spreading.
I understand including DIO need not be interesting for your usecase but I
still think it may be more consistent overall decision. But perhaps I'll
ask Ted what he thinks about it.
> > > > Also I would do
> > > > this regardless of data=writeback/ordered/journalled mode because although
> > > > users wanting data_err=abort behavior will also likely want the guarantees
> > > > of data=ordered mode, these are two different things
> > > For data=journal mode, the journal itself will abort when data is abnormal.
> > > However, as you pointed out, the above bug may cause errors to be missed.
> > > Therefore, we can perform this check by default for journaled files.
> > > > and I can imagine use
> > > > cases for setups with data=writeback and data_err=abort as well (e.g. for
> > > > scratch filesystems which get recreated on each system startup).
> > > Users using data=writeback often do not care about data consistency.
> > > I did not understand your example. Could you please explain it in detail?
> > Well, they don't care about data consistency after a crash. But they
> > usually do care about data consistency while the system is running. And
> > unhandled IO errors can lead to data consistency problems without crashing
> > the system (for example if writeback fails and page gets evicted from
> > memory later, you have lost the new data and may see old version of it).
> I see your point. I concur that it is indeed meaningful for
> data_err=abort to be supported in data=writeback mode.
>
> Thank you for your explanation!
> > And I see data_err=abort as a way to say: "I don't trust my applications to
> > handle IO errors well. Rather take the filesystem down in that case than
> > risk data consistency issues".
> >
> > Honza
>
> I still prefer to think of this as a supplement for users not being able
> to perceive page writeback in a timely manner. The fsync operation is
> complex, requires frequent waiting, and may have omissions.
I agree properly checking for errors from buffered writes is much more
painful.
> In addition, because ext4_end_bio() runs in interrupt context, we can't
> abort the journal directly there due to potential locking issues.
>
> Instead, we now add write-back error checks and journal abortion logic
> to ext4_end_io_end(), which is called by a kworker during unwritten
> extent conversion.
>
> Consequently, for modes that don't support unwritten extents (e.g.,
> nodelalloc, journal_data, see ext4_should_dioread_nolock()), only the
> check in journal_submit_data_buffers() will be effective. Should we
> call the kworker for all files in ext4_end_bio()?
So how I imagined this would work is that if we get error in ext4_end_bio()
and data_err=abort is set, we will queue work (probably stored in the
superblock) to abort the filesystem. Alternatively, a bit more generic
approach might be to store the error state in the io_end and implement
something like:
static bool ext4_io_end_need_defered_completion(ext4_io_end_t *io_end)
{
return io_end->flag & (EXT4_IO_END_UNWRITTEN | EXT4_IO_END_ERROR);
}
and use it in ext4_end_bio() and ext4_put_io_end_defer() to determine
whether the io_end needs processing in the workqueue or not. And
ext4_put_io_end() can then abort the filesystem if EXT4_IO_END_ERROR is
set.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
On 2025/1/8 23:28, Jan Kara wrote:
> On Wed 08-01-25 22:44:42, Baokun Li wrote:
>> On 2025/1/8 21:43, Jan Kara wrote:
>>> On Wed 08-01-25 11:43:08, Baokun Li wrote:
>>>> On 2025/1/6 22:32, Jan Kara wrote:
>>>>>> But as you said, we don't track overwrite writes for performance reasons.
>>>>>> But compared to the poor performance of journal_data and the risk of the
>>>>>> drop cache exposing stale, not being able to sense data errors on overwrite
>>>>>> writes is acceptable.
>>>>>>
>>>>>> After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache
>>>>>> or remount, the user will not see the unexpected all-zero data in the
>>>>>> unwritten area, but rather the earlier consistent data, and the data in
>>>>>> the file is trustworthy, at the cost of some trailing data.
>>>>>>
>>>>>> On the other hand, adding a new written extents and converting an
>>>>>> unwritten extents to written both expose the data to the user, so the user
>>>>>> is concerned about whether the data is correct at that point.
>>>>>>
>>>>>> In general, I think we can update the semantics of “data_err=abort” to,
>>>>>> “Abort the journal if the file fails to write back data on extended writes
>>>>>> in ORDERED mode”. Do you have any thoughts on this?
>>>>> I agree it makes sense to make the semantics of data_err=abort more
>>>>> obvious. Based on the usecase you've described - i.e., rather take the
>>>>> filesystem down on write IO error than risk returning old data later - it
>>>>> would make sense to me to also do this on direct IO writes.
>>>> Okay, I will update the semantics of data_err=abort in the next version.
>>>> For direct I/O writes, I think we don't need it because users can
>>>> perceive errors in time.
>>> So I agree that direct IO users will generally notice the IO error so the
>>> chances for bugs due to missing the IO error is low. But I think the
>>> question is really the other way around: Is there a good reason to make
>>> direct IO writes different? Because if I as a sysadmin want to secure a
>>> system from IO error handling bugs, then having to think whether some
>>> application uses direct IO or not is another nuissance. Why should I be
>>> bothered?
>> This is not quite right. Regardless of whether it is a BIO write or a DIO
>> write, users will check the return value of the write operation, because
>> errors can occur not only when data is written to disk.
> Yes, they *should* check the return value of write(2) and take appropriate
> action. But do all of them check and mainly do they do something meaningful
> with the error? That's what I'm not so sure about :).
Indeed, we cannot confirm that all users will check the return value.
However, we cannot give a commitment that data will not be lost even if
users do not check return values. Giving such a commitment requires us
to intercept all errors returned to users by write operations and abort
the journal. However, write operations may also fail before the data is
written to disk (e.g., -ENOMEM, -EPERM, etc.), therefore checks need to
be added in ext4_file_write_iter() or even VFS... -- We cannot provide
such a guarantee.
>
>> It's just that when a DIO write returns successfully, users can be sure
>> that the data has been written to the disk.
>>
>> However, when a BIO write returns successfully, it only means that the
>> data has been copied into the buffer. Whether it has been successfully
>> written back to the disk is unknown to the user.
>>
>> That's why we need data_err=abort to ensure that users are aware when the
>> page writeback fails and to prevent data corruption from spreading.
> I understand including DIO need not be interesting for your usecase but I
> still think it may be more consistent overall decision. But perhaps I'll
> ask Ted what he thinks about it.
Okay, thanks for asking ted for his opinion on this.
>
>>>>> Also I would do
>>>>> this regardless of data=writeback/ordered/journalled mode because although
>>>>> users wanting data_err=abort behavior will also likely want the guarantees
>>>>> of data=ordered mode, these are two different things
>>>> For data=journal mode, the journal itself will abort when data is abnormal.
>>>> However, as you pointed out, the above bug may cause errors to be missed.
>>>> Therefore, we can perform this check by default for journaled files.
>>>>> and I can imagine use
>>>>> cases for setups with data=writeback and data_err=abort as well (e.g. for
>>>>> scratch filesystems which get recreated on each system startup).
>>>> Users using data=writeback often do not care about data consistency.
>>>> I did not understand your example. Could you please explain it in detail?
>>> Well, they don't care about data consistency after a crash. But they
>>> usually do care about data consistency while the system is running. And
>>> unhandled IO errors can lead to data consistency problems without crashing
>>> the system (for example if writeback fails and page gets evicted from
>>> memory later, you have lost the new data and may see old version of it).
>> I see your point. I concur that it is indeed meaningful for
>> data_err=abort to be supported in data=writeback mode.
>>
>> Thank you for your explanation!
>>> And I see data_err=abort as a way to say: "I don't trust my applications to
>>> handle IO errors well. Rather take the filesystem down in that case than
>>> risk data consistency issues".
>>>
>>> Honza
>> I still prefer to think of this as a supplement for users not being able
>> to perceive page writeback in a timely manner. The fsync operation is
>> complex, requires frequent waiting, and may have omissions.
> I agree properly checking for errors from buffered writes is much more
> painful.
Yeah, indeed.
>> In addition, because ext4_end_bio() runs in interrupt context, we can't
>> abort the journal directly there due to potential locking issues.
>>
>> Instead, we now add write-back error checks and journal abortion logic
>> to ext4_end_io_end(), which is called by a kworker during unwritten
>> extent conversion.
>>
>> Consequently, for modes that don't support unwritten extents (e.g.,
>> nodelalloc, journal_data, see ext4_should_dioread_nolock()), only the
>> check in journal_submit_data_buffers() will be effective. Should we
>> call the kworker for all files in ext4_end_bio()?
> So how I imagined this would work is that if we get error in ext4_end_bio()
> and data_err=abort is set, we will queue work (probably stored in the
> superblock) to abort the filesystem. Alternatively, a bit more generic
> approach might be to store the error state in the io_end and implement
> something like:
>
> static bool ext4_io_end_need_defered_completion(ext4_io_end_t *io_end)
> {
> return io_end->flag & (EXT4_IO_END_UNWRITTEN | EXT4_IO_END_ERROR);
> }
>
> and use it in ext4_end_bio() and ext4_put_io_end_defer() to determine
> whether the io_end needs processing in the workqueue or not. And
> ext4_put_io_end() can then abort the filesystem if EXT4_IO_END_ERROR is
> set.
>
> Honza
This idea looks great.
Thanks so much for the suggestion!
Regards,
Baokun
© 2016 - 2026 Red Hat, Inc.