ext4_fc_track_inode() can return without sleeping when
EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
ei->i_data_sem was done unconditionally before the wait loop, which can
WARN in call paths that hold i_data_sem even though we never block. Move
lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
right before schedule().
Signed-off-by: Li Chen <me@linux.beauty>
---
fs/ext4/fast_commit.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index d0926967d086..b0c458082997 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -566,13 +566,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
return;
- /*
- * If we come here, we may sleep while waiting for the inode to
- * commit. We shouldn't be holding i_data_sem when we go to sleep since
- * the commit path needs to grab the lock while committing the inode.
- */
- lockdep_assert_not_held(&ei->i_data_sem);
-
while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
#if (BITS_PER_LONG < 64)
DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
@@ -586,8 +579,16 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
EXT4_STATE_FC_COMMITTING);
#endif
prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
- if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+ if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+ /*
+ * We might sleep while waiting for the inode to commit.
+ * We shouldn't be holding i_data_sem when we go to sleep
+ * since the commit path may grab it while committing this
+ * inode.
+ */
+ lockdep_assert_not_held(&ei->i_data_sem);
schedule();
+ }
finish_wait(wq, &wait.wq_entry);
}
--
2.52.0
Hi Li,
On 12/24/2025 11:29 AM, Li Chen wrote:
> ext4_fc_track_inode() can return without sleeping when
> EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
> ei->i_data_sem was done unconditionally before the wait loop, which can
> WARN in call paths that hold i_data_sem even though we never block. Move
> lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
> right before schedule().
>
> Signed-off-by: Li Chen <me@linux.beauty>
Thank you for the fix patch! However, the solution does not seem to fix
the issue. IIUC, the root cause of this issue is the following race
condition (show only one case), and it may cause a real ABBA dead lock
issue.
ext4_map_blocks()
hold i_data_sem // <- A
ext4_mb_new_blocks()
ext4_dirty_inode()
ext4_fc_commit()
ext4_fc_perform_commit()
set EXT4_STATE_FC_COMMITTING <-B
ext4_fc_write_inode_data()
ext4_map_blocks()
hold i_data_sem // <- A
ext4_fc_track_inode()
wait EXT4_STATE_FC_COMMITTING <- B
jbd2_fc_end_commit()
ext4_fc_cleanup()
clear EXT4_STATE_FC_COMMITTING()
Postponing the lockdep assertion to the point where sleeping is actually
necessary does not resolve this deadlock issue, it merely masks the
problem, right?
I currently don't quite understand why only ext4_fc_track_inode() needs
to wait for the inode being fast committed to be completed, instead of
adding it to the FC_Q_STAGING list like other tracking operations. So
now I don't have a good idea to fix this problem either. Perhaps we
need to rethink the necessity of this waiting, or find a way to avoid
acquiring i_data_sem during fast commit.
Thanks,
Yi.
> ---
> fs/ext4/fast_commit.c | 17 +++++++++--------
> 1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index d0926967d086..b0c458082997 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -566,13 +566,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
> if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
> return;
>
> - /*
> - * If we come here, we may sleep while waiting for the inode to
> - * commit. We shouldn't be holding i_data_sem when we go to sleep since
> - * the commit path needs to grab the lock while committing the inode.
> - */
> - lockdep_assert_not_held(&ei->i_data_sem);
> -
> while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> #if (BITS_PER_LONG < 64)
> DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> @@ -586,8 +579,16 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
> EXT4_STATE_FC_COMMITTING);
> #endif
> prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> - if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
> + if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> + /*
> + * We might sleep while waiting for the inode to commit.
> + * We shouldn't be holding i_data_sem when we go to sleep
> + * since the commit path may grab it while committing this
> + * inode.
> + */
> + lockdep_assert_not_held(&ei->i_data_sem);
> schedule();
> + }
> finish_wait(wq, &wait.wq_entry);
> }
>
Hi Zhang,
Thanks a lot for your comments!
---- On Mon, 05 Jan 2026 20:18:42 +0800 Zhang Yi <yi.zhang@huaweicloud.com> wrote ---
> Hi Li,
>
> On 12/24/2025 11:29 AM, Li Chen wrote:
> > ext4_fc_track_inode() can return without sleeping when
> > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
> > ei->i_data_sem was done unconditionally before the wait loop, which can
> > WARN in call paths that hold i_data_sem even though we never block. Move
> > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
> > right before schedule().
> >
> > Signed-off-by: Li Chen <me@linux.beauty>
>
> Thank you for the fix patch! However, the solution does not seem to fix
> the issue. IIUC, the root cause of this issue is the following race
> condition (show only one case), and it may cause a real ABBA dead lock
> issue.
>
> ext4_map_blocks()
> hold i_data_sem // <- A
> ext4_mb_new_blocks()
> ext4_dirty_inode()
> ext4_fc_commit()
> ext4_fc_perform_commit()
> set EXT4_STATE_FC_COMMITTING <-B
> ext4_fc_write_inode_data()
> ext4_map_blocks()
> hold i_data_sem // <- A
> ext4_fc_track_inode()
> wait EXT4_STATE_FC_COMMITTING <- B
> jbd2_fc_end_commit()
> ext4_fc_cleanup()
> clear EXT4_STATE_FC_COMMITTING()
>
> Postponing the lockdep assertion to the point where sleeping is actually
> necessary does not resolve this deadlock issue, it merely masks the
> problem, right?
I agree. Moving lockdep_assert_not_held(&ei->i_data_sem) closer to the
schedule() site can reduce spurious warnings (since the wait-bit pattern
rechecks the bit after prepare_to_wait()), but it does not remove the
underlying deadlock risk if we ever end up sleeping there, althoughI still
haven't been able to reproduce this ABBA issue.
> I currently don't quite understand why only ext4_fc_track_inode() needs
> to wait for the inode being fast committed to be completed, instead of
> adding it to the FC_Q_STAGING list like other tracking operations. So
> now I don't have a good idea to fix this problem either. Perhaps we
> need to rethink the necessity of this waiting, or find a way to avoid
> acquiring i_data_sem during fast commit.
>
> Thanks,
> Yi.
>
> > ---
> > fs/ext4/fast_commit.c | 17 +++++++++--------
> > 1 file changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > index d0926967d086..b0c458082997 100644
> > --- a/fs/ext4/fast_commit.c
> > +++ b/fs/ext4/fast_commit.c
> > @@ -566,13 +566,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
> > if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
> > return;
> >
> > - /*
> > - * If we come here, we may sleep while waiting for the inode to
> > - * commit. We shouldn't be holding i_data_sem when we go to sleep since
> > - * the commit path needs to grab the lock while committing the inode.
> > - */
> > - lockdep_assert_not_held(&ei->i_data_sem);
> > -
> > while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> > #if (BITS_PER_LONG < 64)
> > DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> > @@ -586,8 +579,16 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
> > EXT4_STATE_FC_COMMITTING);
> > #endif
> > prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> > - if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
> > + if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> > + /*
> > + * We might sleep while waiting for the inode to commit.
> > + * We shouldn't be holding i_data_sem when we go to sleep
> > + * since the commit path may grab it while committing this
> > + * inode.
> > + */
> > + lockdep_assert_not_held(&ei->i_data_sem);
> > schedule();
> > + }
> > finish_wait(wq, &wait.wq_entry);
> > }
> >
>
>
Regards,
Li
Hi Zhang Yi, ---- On Mon, 05 Jan 2026 20:18:42 +0800 Zhang Yi <yi.zhang@huaweicloud.com> wrote --- > Hi Li, > > On 12/24/2025 11:29 AM, Li Chen wrote: > > ext4_fc_track_inode() can return without sleeping when > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for > > ei->i_data_sem was done unconditionally before the wait loop, which can > > WARN in call paths that hold i_data_sem even though we never block. Move > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path, > > right before schedule(). > > > > Signed-off-by: Li Chen <me@linux.beauty> > > Thank you for the fix patch! However, the solution does not seem to fix > the issue. IIUC, the root cause of this issue is the following race > condition (show only one case), and it may cause a real ABBA dead lock > issue. > > ext4_map_blocks() > hold i_data_sem // <- A > ext4_mb_new_blocks() > ext4_dirty_inode() > ext4_fc_commit() > ext4_fc_perform_commit() > set EXT4_STATE_FC_COMMITTING <-B > ext4_fc_write_inode_data() > ext4_map_blocks() > hold i_data_sem // <- A > ext4_fc_track_inode() > wait EXT4_STATE_FC_COMMITTING <- B > jbd2_fc_end_commit() > ext4_fc_cleanup() > clear EXT4_STATE_FC_COMMITTING() > > Postponing the lockdep assertion to the point where sleeping is actually > necessary does not resolve this deadlock issue, it merely masks the > problem, right? > > I currently don't quite understand why only ext4_fc_track_inode() needs > to wait for the inode being fast committed to be completed, instead of > adding it to the FC_Q_STAGING list like other tracking operations. So > now I don't have a good idea to fix this problem either. Perhaps we > need to rethink the necessity of this waiting, or find a way to avoid > acquiring i_data_sem during fast commit. Thanks a lot for your kind review! I'll provide feedback tomorrow. Regards, Li
On 1/6/2026 8:18 PM, Li Chen wrote: > Hi Zhang Yi, > > ---- On Mon, 05 Jan 2026 20:18:42 +0800 Zhang Yi <yi.zhang@huaweicloud.com> wrote --- > > Hi Li, > > > > On 12/24/2025 11:29 AM, Li Chen wrote: > > > ext4_fc_track_inode() can return without sleeping when > > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for > > > ei->i_data_sem was done unconditionally before the wait loop, which can > > > WARN in call paths that hold i_data_sem even though we never block. Move > > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path, > > > right before schedule(). > > > > > > Signed-off-by: Li Chen <me@linux.beauty> > > > > Thank you for the fix patch! However, the solution does not seem to fix > > the issue. IIUC, the root cause of this issue is the following race > > condition (show only one case), and it may cause a real ABBA dead lock > > issue. > > > > ext4_map_blocks() > > hold i_data_sem // <- A > > ext4_mb_new_blocks() > > ext4_dirty_inode() > > ext4_fc_commit() > > ext4_fc_perform_commit() > > set EXT4_STATE_FC_COMMITTING <-B > > ext4_fc_write_inode_data() > > ext4_map_blocks() > > hold i_data_sem // <- A > > ext4_fc_track_inode() > > wait EXT4_STATE_FC_COMMITTING <- B > > jbd2_fc_end_commit() > > ext4_fc_cleanup() > > clear EXT4_STATE_FC_COMMITTING() > > > > Postponing the lockdep assertion to the point where sleeping is actually > > necessary does not resolve this deadlock issue, it merely masks the > > problem, right? > > > > I currently don't quite understand why only ext4_fc_track_inode() needs > > to wait for the inode being fast committed to be completed, instead of > > adding it to the FC_Q_STAGING list like other tracking operations. It seems that the inode metadata of the tracked inode was not recorded during the __track_inode(), so the inode metadata committed at commit time reflects real-time data. However, the current ext4_fc_perform_commit() lacks concurrency control, allowing other processes to simultaneously initiate new handles that modify the inode metadata while the previous metadata is being fast committed. Therefore, to prevent recording newly changed inode metadata during the old commit phase, the ext4_fc_track_inode() function must wait for the ongoing commit process to complete before modifying. > > So > > now I don't have a good idea to fix this problem either. Perhaps we > > need to rethink the necessity of this waiting, or find a way to avoid > > acquiring i_data_sem during fast commit. Ha, the solution seems to have already been listed in the TODOs in fast_commit.c. Change ext4_fc_commit() to lookup logical to physical mapping using extent status tree. This would get rid of the need to call ext4_fc_track_inode() before acquiring i_data_sem. To do that we would need to ensure that modified extents from the extent status tree are not evicted from memory. Alternatively, recording the mapped range of tracking might also be feasible. Thanks, Yi. > > Thanks a lot for your kind review! I'll provide feedback tomorrow. > > Regards, > Li >
Hi Zhang, Thanks a lot for your detailed review! ---- On Wed, 07 Jan 2026 10:00:23 +0800 Zhang Yi <yi.zhang@huaweicloud.com> wrote --- > On 1/6/2026 8:18 PM, Li Chen wrote: > > Hi Zhang Yi, > > > > ---- On Mon, 05 Jan 2026 20:18:42 +0800 Zhang Yi <yi.zhang@huaweicloud.com> wrote --- > > > Hi Li, > > > > > > On 12/24/2025 11:29 AM, Li Chen wrote: > > > > ext4_fc_track_inode() can return without sleeping when > > > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for > > > > ei->i_data_sem was done unconditionally before the wait loop, which can > > > > WARN in call paths that hold i_data_sem even though we never block. Move > > > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path, > > > > right before schedule(). > > > > > > > > Signed-off-by: Li Chen <me@linux.beauty> > > > > > > Thank you for the fix patch! However, the solution does not seem to fix > > > the issue. IIUC, the root cause of this issue is the following race > > > condition (show only one case), and it may cause a real ABBA dead lock > > > issue. > > > > > > ext4_map_blocks() > > > hold i_data_sem // <- A > > > ext4_mb_new_blocks() > > > ext4_dirty_inode() > > > ext4_fc_commit() > > > ext4_fc_perform_commit() > > > set EXT4_STATE_FC_COMMITTING <-B > > > ext4_fc_write_inode_data() > > > ext4_map_blocks() > > > hold i_data_sem // <- A > > > ext4_fc_track_inode() > > > wait EXT4_STATE_FC_COMMITTING <- B > > > jbd2_fc_end_commit() > > > ext4_fc_cleanup() > > > clear EXT4_STATE_FC_COMMITTING() > > > I think the ABBA reasoning is plausible: if a caller violates the ordering contract and enters ext4_fc_track_inode() while holding i_data_sem, and the recheck still finds EXT4_STATE_FC_COMMITTING set (so we actually schedule()), then we can get A -> wait(B). If the commit task, while holding the inode in COMMITTING, still needs i_data_sem (e.g. via mapping/log writing), that gives B -> wait(A), forming a cycle. > > > Postponing the lockdep assertion to the point where sleeping is actually > > > necessary does not resolve this deadlock issue, it merely masks the > > > problem, right? > > > > > > I currently don't quite understand why only ext4_fc_track_inode() needs > > > to wait for the inode being fast committed to be completed, instead of > > > adding it to the FC_Q_STAGING list like other tracking operations. > > It seems that the inode metadata of the tracked inode was not recorded > during the __track_inode(), so the inode metadata committed at commit > time reflects real-time data. However, the current > ext4_fc_perform_commit() lacks concurrency control, allowing other > processes to simultaneously initiate new handles that modify the inode > metadata while the previous metadata is being fast committed. Therefore, > to prevent recording newly changed inode metadata during the old commit > phase, the ext4_fc_track_inode() function must wait for the ongoing > commit process to complete before modifying. > > > > So > > > now I don't have a good idea to fix this problem either. Perhaps we > > > need to rethink the necessity of this waiting, or find a way to avoid > > > acquiring i_data_sem during fast commit. > > Ha, the solution seems to have already been listed in the TODOs in > fast_commit.c. > > Change ext4_fc_commit() to lookup logical to physical mapping using extent > status tree. This would get rid of the need to call ext4_fc_track_inode() > before acquiring i_data_sem. To do that we would need to ensure that > modified extents from the extent status tree are not evicted from memory. > > Alternatively, recording the mapped range of tracking might also be > feasible. Thanks a lot for your insights! For the next revesion, I plan to follow the "Alternatively" way firstly: record the mapped ranges (and relvant inode metadata) at commit time in a snapshot, when journal updates are locked/handles are drained, and then consume only the snapshot during log writing. This avoids doing logical-to-physical mapping (and thus avoids taking i_data_sem) in the log writing phase, and removes the need for ext4_fc_track_inode() to wait for EXT4_STATE_FC_COMMITTING. I did not pick the extent status tree approach because it would require additional work to guarantee the needed mappings are resident and not evicted under memory pressure, which seems like a larger correctness surface(Please correct me if I'm wrong). If you believe the extent stats tree approach is better, please let me know, and I will do my best to implement it. Thanks again for the guidance. I'll post an RFC v3 later. Regards, Li
On 1/7/2026 10:30 PM, Li Chen wrote: > Hi Zhang, > > Thanks a lot for your detailed review! > > ---- On Wed, 07 Jan 2026 10:00:23 +0800 Zhang Yi <yi.zhang@huaweicloud.com> wrote --- > > On 1/6/2026 8:18 PM, Li Chen wrote: > > > Hi Zhang Yi, > > > > > > ---- On Mon, 05 Jan 2026 20:18:42 +0800 Zhang Yi <yi.zhang@huaweicloud.com> wrote --- > > > > Hi Li, > > > > > > > > On 12/24/2025 11:29 AM, Li Chen wrote: > > > > > ext4_fc_track_inode() can return without sleeping when > > > > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for > > > > > ei->i_data_sem was done unconditionally before the wait loop, which can > > > > > WARN in call paths that hold i_data_sem even though we never block. Move > > > > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path, > > > > > right before schedule(). > > > > > > > > > > Signed-off-by: Li Chen <me@linux.beauty> > > > > > > > > Thank you for the fix patch! However, the solution does not seem to fix > > > > the issue. IIUC, the root cause of this issue is the following race > > > > condition (show only one case), and it may cause a real ABBA dead lock > > > > issue. > > > > > > > > ext4_map_blocks() > > > > hold i_data_sem // <- A > > > > ext4_mb_new_blocks() > > > > ext4_dirty_inode() > > > > ext4_fc_commit() > > > > ext4_fc_perform_commit() > > > > set EXT4_STATE_FC_COMMITTING <-B > > > > ext4_fc_write_inode_data() > > > > ext4_map_blocks() > > > > hold i_data_sem // <- A > > > > ext4_fc_track_inode() > > > > wait EXT4_STATE_FC_COMMITTING <- B > > > > jbd2_fc_end_commit() > > > > ext4_fc_cleanup() > > > > clear EXT4_STATE_FC_COMMITTING() > > > > > > I think the ABBA reasoning is plausible: if a caller violates the ordering > contract and enters ext4_fc_track_inode() while holding i_data_sem, and the > recheck still finds EXT4_STATE_FC_COMMITTING set (so we actually schedule()), > then we can get A -> wait(B). If the commit task, while holding the inode > in COMMITTING, still needs i_data_sem (e.g. via mapping/log writing), that > gives B -> wait(A), forming a cycle. > > > > > Postponing the lockdep assertion to the point where sleeping is actually > > > > necessary does not resolve this deadlock issue, it merely masks the > > > > problem, right? > > > > > > > > I currently don't quite understand why only ext4_fc_track_inode() needs > > > > to wait for the inode being fast committed to be completed, instead of > > > > adding it to the FC_Q_STAGING list like other tracking operations. > > > > It seems that the inode metadata of the tracked inode was not recorded > > during the __track_inode(), so the inode metadata committed at commit > > time reflects real-time data. However, the current > > ext4_fc_perform_commit() lacks concurrency control, allowing other > > processes to simultaneously initiate new handles that modify the inode > > metadata while the previous metadata is being fast committed. Therefore, > > to prevent recording newly changed inode metadata during the old commit > > phase, the ext4_fc_track_inode() function must wait for the ongoing > > commit process to complete before modifying. > > > > > > So > > > > now I don't have a good idea to fix this problem either. Perhaps we > > > > need to rethink the necessity of this waiting, or find a way to avoid > > > > acquiring i_data_sem during fast commit. > > > > Ha, the solution seems to have already been listed in the TODOs in > > fast_commit.c. > > > > Change ext4_fc_commit() to lookup logical to physical mapping using extent > > status tree. This would get rid of the need to call ext4_fc_track_inode() > > before acquiring i_data_sem. To do that we would need to ensure that > > modified extents from the extent status tree are not evicted from memory. > > > > Alternatively, recording the mapped range of tracking might also be > > feasible. > > Thanks a lot for your insights! > > For the next revesion, I plan to follow the "Alternatively" way firstly: > record the mapped ranges (and relvant inode metadata) at commit time in a > snapshot, when journal updates are locked/handles are drained, and then > consume only the snapshot during log writing. This avoids doing > logical-to-physical mapping (and thus avoids taking i_data_sem) in the log > writing phase, and removes the need for ext4_fc_track_inode() to wait for > EXT4_STATE_FC_COMMITTING. > > I did not pick the extent status tree approach because it would require > additional work to guarantee the needed mappings are resident and not > evicted under memory pressure, which seems like a larger correctness > surface(Please correct me if I'm wrong). If you believe the extent stats tree > approach is better, please let me know, and I will do my best to implement it. > > Thanks again for the guidance. I'll post an RFC v3 later. Yes, in my opinion, I also prefer the solution of recording the mapped ranges, as it should not incur too much overhead. Please give it a try. Cheers, Yi. > > Regards, > Li >
© 2016 - 2026 Red Hat, Inc.