[RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep

Li Chen posted 2 patches 1 month, 2 weeks ago
[RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep
Posted by Li Chen 1 month, 2 weeks ago
ext4_fc_track_inode() can return without sleeping when
EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
ei->i_data_sem was done unconditionally before the wait loop, which can
WARN in call paths that hold i_data_sem even though we never block. Move
lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
right before schedule().

Signed-off-by: Li Chen <me@linux.beauty>
---
 fs/ext4/fast_commit.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index d0926967d086..b0c458082997 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -566,13 +566,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 	if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
 		return;
 
-	/*
-	 * If we come here, we may sleep while waiting for the inode to
-	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
-	 * the commit path needs to grab the lock while committing the inode.
-	 */
-	lockdep_assert_not_held(&ei->i_data_sem);
-
 	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
 #if (BITS_PER_LONG < 64)
 		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
@@ -586,8 +579,16 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 				   EXT4_STATE_FC_COMMITTING);
 #endif
 		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
-		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+			/*
+			 * We might sleep while waiting for the inode to commit.
+			 * We shouldn't be holding i_data_sem when we go to sleep
+			 * since the commit path may grab it while committing this
+			 * inode.
+			 */
+			lockdep_assert_not_held(&ei->i_data_sem);
 			schedule();
+		}
 		finish_wait(wq, &wait.wq_entry);
 	}
 
-- 
2.52.0
Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep
Posted by Zhang Yi 1 month ago
Hi Li,

On 12/24/2025 11:29 AM, Li Chen wrote:
> ext4_fc_track_inode() can return without sleeping when
> EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
> ei->i_data_sem was done unconditionally before the wait loop, which can
> WARN in call paths that hold i_data_sem even though we never block. Move
> lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
> right before schedule().
> 
> Signed-off-by: Li Chen <me@linux.beauty>

Thank you for the fix patch! However, the solution does not seem to fix
the issue. IIUC, the root cause of this issue is the following race
condition (show only one case), and it may cause a real ABBA dead lock
issue.

ext4_map_blocks()
 hold i_data_sem // <- A
 ext4_mb_new_blocks()
  ext4_dirty_inode()
                                 ext4_fc_commit()
                                  ext4_fc_perform_commit()
                                   set EXT4_STATE_FC_COMMITTING  <-B
                                   ext4_fc_write_inode_data()
                                   ext4_map_blocks()
                                    hold i_data_sem  // <- A
   ext4_fc_track_inode()
    wait EXT4_STATE_FC_COMMITTING  <- B
                                  jbd2_fc_end_commit()
                                   ext4_fc_cleanup()
                                    clear EXT4_STATE_FC_COMMITTING()

Postponing the lockdep assertion to the point where sleeping is actually
necessary does not resolve this deadlock issue, it merely masks the
problem, right?

I currently don't quite understand why only ext4_fc_track_inode() needs
to wait for the inode being fast committed to be completed, instead of
adding it to the FC_Q_STAGING list like other tracking operations. So
now I don't have a good idea to fix this problem either.  Perhaps we
need to rethink the necessity of this waiting, or find a way to avoid
acquiring i_data_sem during fast commit.

Thanks,
Yi.

> ---
>  fs/ext4/fast_commit.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index d0926967d086..b0c458082997 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -566,13 +566,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
>  	if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
>  		return;
>  
> -	/*
> -	 * If we come here, we may sleep while waiting for the inode to
> -	 * commit. We shouldn't be holding i_data_sem when we go to sleep since
> -	 * the commit path needs to grab the lock while committing the inode.
> -	 */
> -	lockdep_assert_not_held(&ei->i_data_sem);
> -
>  	while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
>  #if (BITS_PER_LONG < 64)
>  		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> @@ -586,8 +579,16 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
>  				   EXT4_STATE_FC_COMMITTING);
>  #endif
>  		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> -		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
> +		if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> +			/*
> +			 * We might sleep while waiting for the inode to commit.
> +			 * We shouldn't be holding i_data_sem when we go to sleep
> +			 * since the commit path may grab it while committing this
> +			 * inode.
> +			 */
> +			lockdep_assert_not_held(&ei->i_data_sem);
>  			schedule();
> +		}
>  		finish_wait(wq, &wait.wq_entry);
>  	}
>
Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep
Posted by Li Chen 1 month ago
Hi Zhang,

Thanks a lot for your comments!

 ---- On Mon, 05 Jan 2026 20:18:42 +0800  Zhang Yi <yi.zhang@huaweicloud.com> wrote --- 
 > Hi Li,
 > 
 > On 12/24/2025 11:29 AM, Li Chen wrote:
 > > ext4_fc_track_inode() can return without sleeping when
 > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
 > > ei->i_data_sem was done unconditionally before the wait loop, which can
 > > WARN in call paths that hold i_data_sem even though we never block. Move
 > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
 > > right before schedule().
 > > 
 > > Signed-off-by: Li Chen <me@linux.beauty>
 > 
 > Thank you for the fix patch! However, the solution does not seem to fix
 > the issue. IIUC, the root cause of this issue is the following race
 > condition (show only one case), and it may cause a real ABBA dead lock
 > issue.
 > 
 > ext4_map_blocks()
 >  hold i_data_sem // <- A
 >  ext4_mb_new_blocks()
 >   ext4_dirty_inode()
 >                                  ext4_fc_commit()
 >                                   ext4_fc_perform_commit()
 >                                    set EXT4_STATE_FC_COMMITTING  <-B
 >                                    ext4_fc_write_inode_data()
 >                                    ext4_map_blocks()
 >                                     hold i_data_sem  // <- A
 >    ext4_fc_track_inode()
 >     wait EXT4_STATE_FC_COMMITTING  <- B
 >                                   jbd2_fc_end_commit()
 >                                    ext4_fc_cleanup()
 >                                     clear EXT4_STATE_FC_COMMITTING()
 > 
 > Postponing the lockdep assertion to the point where sleeping is actually
 > necessary does not resolve this deadlock issue, it merely masks the
 > problem, right?
 
I agree. Moving lockdep_assert_not_held(&ei->i_data_sem) closer to the
schedule() site can reduce spurious warnings (since the wait-bit pattern
rechecks the bit after prepare_to_wait()), but it does not remove the
underlying deadlock risk if we ever end up sleeping there, althoughI still 
haven't been able to reproduce this ABBA issue.

 > I currently don't quite understand why only ext4_fc_track_inode() needs
 > to wait for the inode being fast committed to be completed, instead of
 > adding it to the FC_Q_STAGING list like other tracking operations. So
 > now I don't have a good idea to fix this problem either.  Perhaps we
 > need to rethink the necessity of this waiting, or find a way to avoid
 > acquiring i_data_sem during fast commit.
 > 
 > Thanks,
 > Yi.
 > 
 > > ---
 > >  fs/ext4/fast_commit.c | 17 +++++++++--------
 > >  1 file changed, 9 insertions(+), 8 deletions(-)
 > > 
 > > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
 > > index d0926967d086..b0c458082997 100644
 > > --- a/fs/ext4/fast_commit.c
 > > +++ b/fs/ext4/fast_commit.c
 > > @@ -566,13 +566,6 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 > >      if (ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE))
 > >          return;
 > >  
 > > -    /*
 > > -     * If we come here, we may sleep while waiting for the inode to
 > > -     * commit. We shouldn't be holding i_data_sem when we go to sleep since
 > > -     * the commit path needs to grab the lock while committing the inode.
 > > -     */
 > > -    lockdep_assert_not_held(&ei->i_data_sem);
 > > -
 > >      while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
 > >  #if (BITS_PER_LONG < 64)
 > >          DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
 > > @@ -586,8 +579,16 @@ void ext4_fc_track_inode(handle_t *handle, struct inode *inode)
 > >                     EXT4_STATE_FC_COMMITTING);
 > >  #endif
 > >          prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 > > -        if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
 > > +        if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
 > > +            /*
 > > +             * We might sleep while waiting for the inode to commit.
 > > +             * We shouldn't be holding i_data_sem when we go to sleep
 > > +             * since the commit path may grab it while committing this
 > > +             * inode.
 > > +             */
 > > +            lockdep_assert_not_held(&ei->i_data_sem);
 > >              schedule();
 > > +        }
 > >          finish_wait(wq, &wait.wq_entry);
 > >      }
 > >  
 > 
 > 

Regards,
Li​
Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep
Posted by Li Chen 1 month ago
Hi Zhang Yi,

 ---- On Mon, 05 Jan 2026 20:18:42 +0800  Zhang Yi <yi.zhang@huaweicloud.com> wrote --- 
 > Hi Li,
 > 
 > On 12/24/2025 11:29 AM, Li Chen wrote:
 > > ext4_fc_track_inode() can return without sleeping when
 > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
 > > ei->i_data_sem was done unconditionally before the wait loop, which can
 > > WARN in call paths that hold i_data_sem even though we never block. Move
 > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
 > > right before schedule().
 > > 
 > > Signed-off-by: Li Chen <me@linux.beauty>
 > 
 > Thank you for the fix patch! However, the solution does not seem to fix
 > the issue. IIUC, the root cause of this issue is the following race
 > condition (show only one case), and it may cause a real ABBA dead lock
 > issue.
 > 
 > ext4_map_blocks()
 >  hold i_data_sem // <- A
 >  ext4_mb_new_blocks()
 >   ext4_dirty_inode()
 >                                  ext4_fc_commit()
 >                                   ext4_fc_perform_commit()
 >                                    set EXT4_STATE_FC_COMMITTING  <-B
 >                                    ext4_fc_write_inode_data()
 >                                    ext4_map_blocks()
 >                                     hold i_data_sem  // <- A
 >    ext4_fc_track_inode()
 >     wait EXT4_STATE_FC_COMMITTING  <- B
 >                                   jbd2_fc_end_commit()
 >                                    ext4_fc_cleanup()
 >                                     clear EXT4_STATE_FC_COMMITTING()
 > 
 > Postponing the lockdep assertion to the point where sleeping is actually
 > necessary does not resolve this deadlock issue, it merely masks the
 > problem, right?
 > 
 > I currently don't quite understand why only ext4_fc_track_inode() needs
 > to wait for the inode being fast committed to be completed, instead of
 > adding it to the FC_Q_STAGING list like other tracking operations. So
 > now I don't have a good idea to fix this problem either.  Perhaps we
 > need to rethink the necessity of this waiting, or find a way to avoid
 > acquiring i_data_sem during fast commit.

Thanks a lot for your kind review! I'll provide feedback tomorrow.

Regards,
Li​
Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep
Posted by Zhang Yi 1 month ago
On 1/6/2026 8:18 PM, Li Chen wrote:
> Hi Zhang Yi,
> 
>  ---- On Mon, 05 Jan 2026 20:18:42 +0800  Zhang Yi <yi.zhang@huaweicloud.com> wrote --- 
>  > Hi Li,
>  > 
>  > On 12/24/2025 11:29 AM, Li Chen wrote:
>  > > ext4_fc_track_inode() can return without sleeping when
>  > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
>  > > ei->i_data_sem was done unconditionally before the wait loop, which can
>  > > WARN in call paths that hold i_data_sem even though we never block. Move
>  > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
>  > > right before schedule().
>  > > 
>  > > Signed-off-by: Li Chen <me@linux.beauty>
>  > 
>  > Thank you for the fix patch! However, the solution does not seem to fix
>  > the issue. IIUC, the root cause of this issue is the following race
>  > condition (show only one case), and it may cause a real ABBA dead lock
>  > issue.
>  > 
>  > ext4_map_blocks()
>  >  hold i_data_sem // <- A
>  >  ext4_mb_new_blocks()
>  >   ext4_dirty_inode()
>  >                                  ext4_fc_commit()
>  >                                   ext4_fc_perform_commit()
>  >                                    set EXT4_STATE_FC_COMMITTING  <-B
>  >                                    ext4_fc_write_inode_data()
>  >                                    ext4_map_blocks()
>  >                                     hold i_data_sem  // <- A
>  >    ext4_fc_track_inode()
>  >     wait EXT4_STATE_FC_COMMITTING  <- B
>  >                                   jbd2_fc_end_commit()
>  >                                    ext4_fc_cleanup()
>  >                                     clear EXT4_STATE_FC_COMMITTING()
>  > 
>  > Postponing the lockdep assertion to the point where sleeping is actually
>  > necessary does not resolve this deadlock issue, it merely masks the
>  > problem, right?
>  > 
>  > I currently don't quite understand why only ext4_fc_track_inode() needs
>  > to wait for the inode being fast committed to be completed, instead of
>  > adding it to the FC_Q_STAGING list like other tracking operations.

It seems that the inode metadata of the tracked inode was not recorded
during the __track_inode(), so the inode metadata committed at commit
time reflects real-time data. However, the current
ext4_fc_perform_commit() lacks concurrency control, allowing other
processes to simultaneously initiate new handles that modify the inode
metadata while the previous metadata is being fast committed. Therefore,
to prevent recording newly changed inode metadata during the old commit
phase, the ext4_fc_track_inode() function must wait for the ongoing
commit process to complete before modifying.

>  > So
>  > now I don't have a good idea to fix this problem either.  Perhaps we
>  > need to rethink the necessity of this waiting, or find a way to avoid
>  > acquiring i_data_sem during fast commit.

Ha, the solution seems to have already been listed in the TODOs in
fast_commit.c.

  Change ext4_fc_commit() to lookup logical to physical mapping using extent
  status tree. This would get rid of the need to call ext4_fc_track_inode()
  before acquiring i_data_sem. To do that we would need to ensure that
  modified extents from the extent status tree are not evicted from memory.

Alternatively, recording the mapped range of tracking might also be
feasible.

Thanks,
Yi.

> 
> Thanks a lot for your kind review! I'll provide feedback tomorrow.
> 
> Regards,
> Li​
> 

Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep
Posted by Li Chen 1 month ago
Hi Zhang,

Thanks a lot for your detailed review!

 ---- On Wed, 07 Jan 2026 10:00:23 +0800  Zhang Yi <yi.zhang@huaweicloud.com> wrote --- 
 > On 1/6/2026 8:18 PM, Li Chen wrote:
 > > Hi Zhang Yi,
 > > 
 > >  ---- On Mon, 05 Jan 2026 20:18:42 +0800  Zhang Yi <yi.zhang@huaweicloud.com> wrote --- 
 > >  > Hi Li,
 > >  > 
 > >  > On 12/24/2025 11:29 AM, Li Chen wrote:
 > >  > > ext4_fc_track_inode() can return without sleeping when
 > >  > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
 > >  > > ei->i_data_sem was done unconditionally before the wait loop, which can
 > >  > > WARN in call paths that hold i_data_sem even though we never block. Move
 > >  > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
 > >  > > right before schedule().
 > >  > > 
 > >  > > Signed-off-by: Li Chen <me@linux.beauty>
 > >  > 
 > >  > Thank you for the fix patch! However, the solution does not seem to fix
 > >  > the issue. IIUC, the root cause of this issue is the following race
 > >  > condition (show only one case), and it may cause a real ABBA dead lock
 > >  > issue.
 > >  > 
 > >  > ext4_map_blocks()
 > >  >  hold i_data_sem // <- A
 > >  >  ext4_mb_new_blocks()
 > >  >   ext4_dirty_inode()
 > >  >                                  ext4_fc_commit()
 > >  >                                   ext4_fc_perform_commit()
 > >  >                                    set EXT4_STATE_FC_COMMITTING  <-B
 > >  >                                    ext4_fc_write_inode_data()
 > >  >                                    ext4_map_blocks()
 > >  >                                     hold i_data_sem  // <- A
 > >  >    ext4_fc_track_inode()
 > >  >     wait EXT4_STATE_FC_COMMITTING  <- B
 > >  >                                   jbd2_fc_end_commit()
 > >  >                                    ext4_fc_cleanup()
 > >  >                                     clear EXT4_STATE_FC_COMMITTING()
 > >  > 
 
I think the ABBA reasoning is plausible: if a caller violates the ordering
contract and enters ext4_fc_track_inode() while holding i_data_sem, and the
recheck still finds EXT4_STATE_FC_COMMITTING set (so we actually schedule()),
then we can get A -> wait(B). If the commit task, while holding the inode
in COMMITTING, still needs i_data_sem (e.g. via mapping/log writing), that
gives B -> wait(A), forming a cycle.

 > >  > Postponing the lockdep assertion to the point where sleeping is actually
 > >  > necessary does not resolve this deadlock issue, it merely masks the
 > >  > problem, right?
 > >  > 
 > >  > I currently don't quite understand why only ext4_fc_track_inode() needs
 > >  > to wait for the inode being fast committed to be completed, instead of
 > >  > adding it to the FC_Q_STAGING list like other tracking operations.
 > 
 > It seems that the inode metadata of the tracked inode was not recorded
 > during the __track_inode(), so the inode metadata committed at commit
 > time reflects real-time data. However, the current
 > ext4_fc_perform_commit() lacks concurrency control, allowing other
 > processes to simultaneously initiate new handles that modify the inode
 > metadata while the previous metadata is being fast committed. Therefore,
 > to prevent recording newly changed inode metadata during the old commit
 > phase, the ext4_fc_track_inode() function must wait for the ongoing
 > commit process to complete before modifying.
 > 
 > >  > So
 > >  > now I don't have a good idea to fix this problem either.  Perhaps we
 > >  > need to rethink the necessity of this waiting, or find a way to avoid
 > >  > acquiring i_data_sem during fast commit.
 > 
 > Ha, the solution seems to have already been listed in the TODOs in
 > fast_commit.c.
 > 
 >   Change ext4_fc_commit() to lookup logical to physical mapping using extent
 >   status tree. This would get rid of the need to call ext4_fc_track_inode()
 >   before acquiring i_data_sem. To do that we would need to ensure that
 >   modified extents from the extent status tree are not evicted from memory.
 > 
 > Alternatively, recording the mapped range of tracking might also be
 > feasible.

Thanks a lot for your insights!

For the next revesion, I plan to follow the "Alternatively" way firstly:
record the mapped ranges (and relvant inode metadata) at commit time in a
snapshot, when journal updates are locked/handles are drained, and then
consume only the snapshot during log writing. This avoids doing
logical-to-physical mapping (and thus avoids taking i_data_sem) in the log
writing phase, and removes the need for ext4_fc_track_inode() to wait for
EXT4_STATE_FC_COMMITTING.

I did not pick the extent status tree approach because it would require
additional work to guarantee the needed mappings are resident and not
evicted under memory pressure, which seems like a larger correctness
surface(Please correct me if I'm wrong). If you believe the extent stats tree
approach is better, please let me know, and I will do my best to implement it.

Thanks again for the guidance. I'll post an RFC v3 later.

Regards,
Li​
Re: [RFC v3 1/2] ext4: fast_commit: assert i_data_sem only before sleep
Posted by Zhang Yi 1 month ago
On 1/7/2026 10:30 PM, Li Chen wrote:
> Hi Zhang,
> 
> Thanks a lot for your detailed review!
> 
>  ---- On Wed, 07 Jan 2026 10:00:23 +0800  Zhang Yi <yi.zhang@huaweicloud.com> wrote --- 
>  > On 1/6/2026 8:18 PM, Li Chen wrote:
>  > > Hi Zhang Yi,
>  > > 
>  > >  ---- On Mon, 05 Jan 2026 20:18:42 +0800  Zhang Yi <yi.zhang@huaweicloud.com> wrote --- 
>  > >  > Hi Li,
>  > >  > 
>  > >  > On 12/24/2025 11:29 AM, Li Chen wrote:
>  > >  > > ext4_fc_track_inode() can return without sleeping when
>  > >  > > EXT4_STATE_FC_COMMITTING is already clear. The lockdep assertion for
>  > >  > > ei->i_data_sem was done unconditionally before the wait loop, which can
>  > >  > > WARN in call paths that hold i_data_sem even though we never block. Move
>  > >  > > lockdep_assert_not_held(&ei->i_data_sem) into the actual sleep path,
>  > >  > > right before schedule().
>  > >  > > 
>  > >  > > Signed-off-by: Li Chen <me@linux.beauty>
>  > >  > 
>  > >  > Thank you for the fix patch! However, the solution does not seem to fix
>  > >  > the issue. IIUC, the root cause of this issue is the following race
>  > >  > condition (show only one case), and it may cause a real ABBA dead lock
>  > >  > issue.
>  > >  > 
>  > >  > ext4_map_blocks()
>  > >  >  hold i_data_sem // <- A
>  > >  >  ext4_mb_new_blocks()
>  > >  >   ext4_dirty_inode()
>  > >  >                                  ext4_fc_commit()
>  > >  >                                   ext4_fc_perform_commit()
>  > >  >                                    set EXT4_STATE_FC_COMMITTING  <-B
>  > >  >                                    ext4_fc_write_inode_data()
>  > >  >                                    ext4_map_blocks()
>  > >  >                                     hold i_data_sem  // <- A
>  > >  >    ext4_fc_track_inode()
>  > >  >     wait EXT4_STATE_FC_COMMITTING  <- B
>  > >  >                                   jbd2_fc_end_commit()
>  > >  >                                    ext4_fc_cleanup()
>  > >  >                                     clear EXT4_STATE_FC_COMMITTING()
>  > >  > 
>  
> I think the ABBA reasoning is plausible: if a caller violates the ordering
> contract and enters ext4_fc_track_inode() while holding i_data_sem, and the
> recheck still finds EXT4_STATE_FC_COMMITTING set (so we actually schedule()),
> then we can get A -> wait(B). If the commit task, while holding the inode
> in COMMITTING, still needs i_data_sem (e.g. via mapping/log writing), that
> gives B -> wait(A), forming a cycle.
> 
>  > >  > Postponing the lockdep assertion to the point where sleeping is actually
>  > >  > necessary does not resolve this deadlock issue, it merely masks the
>  > >  > problem, right?
>  > >  > 
>  > >  > I currently don't quite understand why only ext4_fc_track_inode() needs
>  > >  > to wait for the inode being fast committed to be completed, instead of
>  > >  > adding it to the FC_Q_STAGING list like other tracking operations.
>  > 
>  > It seems that the inode metadata of the tracked inode was not recorded
>  > during the __track_inode(), so the inode metadata committed at commit
>  > time reflects real-time data. However, the current
>  > ext4_fc_perform_commit() lacks concurrency control, allowing other
>  > processes to simultaneously initiate new handles that modify the inode
>  > metadata while the previous metadata is being fast committed. Therefore,
>  > to prevent recording newly changed inode metadata during the old commit
>  > phase, the ext4_fc_track_inode() function must wait for the ongoing
>  > commit process to complete before modifying.
>  > 
>  > >  > So
>  > >  > now I don't have a good idea to fix this problem either.  Perhaps we
>  > >  > need to rethink the necessity of this waiting, or find a way to avoid
>  > >  > acquiring i_data_sem during fast commit.
>  > 
>  > Ha, the solution seems to have already been listed in the TODOs in
>  > fast_commit.c.
>  > 
>  >   Change ext4_fc_commit() to lookup logical to physical mapping using extent
>  >   status tree. This would get rid of the need to call ext4_fc_track_inode()
>  >   before acquiring i_data_sem. To do that we would need to ensure that
>  >   modified extents from the extent status tree are not evicted from memory.
>  > 
>  > Alternatively, recording the mapped range of tracking might also be
>  > feasible.
> 
> Thanks a lot for your insights!
> 
> For the next revesion, I plan to follow the "Alternatively" way firstly:
> record the mapped ranges (and relvant inode metadata) at commit time in a
> snapshot, when journal updates are locked/handles are drained, and then
> consume only the snapshot during log writing. This avoids doing
> logical-to-physical mapping (and thus avoids taking i_data_sem) in the log
> writing phase, and removes the need for ext4_fc_track_inode() to wait for
> EXT4_STATE_FC_COMMITTING.
> 
> I did not pick the extent status tree approach because it would require
> additional work to guarantee the needed mappings are resident and not
> evicted under memory pressure, which seems like a larger correctness
> surface(Please correct me if I'm wrong). If you believe the extent stats tree
> approach is better, please let me know, and I will do my best to implement it.
> 
> Thanks again for the guidance. I'll post an RFC v3 later.

Yes, in my opinion, I also prefer the solution of recording the mapped ranges,
as it should not incur too much overhead. Please give it a try.

Cheers,
Yi.

> 
> Regards,
> Li​
>