[PATCH v3 2/3] ext4: avoid journaling sb update on error if journal is destroying

Ojaswin Mujoo posted 3 patches 9 months, 1 week ago
There is a newer version of this series
[PATCH v3 2/3] ext4: avoid journaling sb update on error if journal is destroying
Posted by Ojaswin Mujoo 9 months, 1 week ago
Presently we always BUG_ON if trying to start a transaction on a journal marked
with JBD2_UNMOUNT, since this should never happen. However, while ltp running
stress tests, it was observed that in case of some error handling paths, it is
possible for update_super_work to start a transaction after the journal is
destroyed eg:

(umount)
ext4_kill_sb
  kill_block_super
    generic_shutdown_super
      sync_filesystem /* commits all txns */
      evict_inodes
        /* might start a new txn */
      ext4_put_super
	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
        jbd2_journal_destroy
          journal_kill_thread
            journal->j_flags |= JBD2_UNMOUNT;
          jbd2_journal_commit_transaction
            jbd2_journal_get_descriptor_buffer
              jbd2_journal_bmap
                ext4_journal_bmap
                  ext4_map_blocks
                    ...
                    ext4_inode_error
                      ext4_handle_error
                        schedule_work(&sbi->s_sb_upd_work)

                                               /* work queue kicks in */
                                               update_super_work
                                                 jbd2_journal_start
                                                   start_this_handle
                                                     BUG_ON(journal->j_flags &
                                                            JBD2_UNMOUNT)

Hence, introduce a new mount flag to indicate journal is destroying and only do
a journaled (and deferred) update of sb if this flag is not set. Otherwise, just
fallback to an un-journaled commit.

Further, in the journal destroy path, we have the following sequence:

  1. Set mount flag indicating journal is destroying
  2. force a commit and wait for it
  3. flush pending sb updates

This sequence is important as it ensures that, after this point, there is no sb
update that might be journaled so it is safe to update the sb outside the
journal. (To avoid race discussed in 2d01ddc86606)

Also, we don't need a similar check in ext4_grp_locked_error since it is only
called from mballoc and AFAICT it would be always valid to schedule work here.

Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h      |  1 +
 fs/ext4/ext4_jbd2.h | 15 +++++++++++++++
 fs/ext4/super.c     | 12 ++++++------
 3 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2b7d781bfcad..ee54b8510791 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1179,6 +1179,7 @@ struct ext4_inode_info {
 #define	EXT4_ERROR_FS			0x0002	/* Errors detected */
 #define	EXT4_ORPHAN_FS			0x0004	/* Orphans being recovered */
 #define EXT4_FC_REPLAY			0x0020	/* Fast commit replay ongoing */
+#define EXT4_JOURNAL_DESTORY		0x0040	/* Journal is in process of destroying */
 
 /*
  * Misc. filesystem flags
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 9b3c9df02a39..f054db9fa9e1 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -437,6 +437,21 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
 {
 	int err = 0;
 
+	/*
+	 * At this point only two things can be operating on the journal.
+	 * JBD2 thread performing transaction commit and s_sb_upd_work
+	 * issuing sb update through the journal. Once we set
+	 * EXT4_JOURNAL_DESTROY, new ext4_handle_error() calls will not
+	 * queue s_sb_upd_work and ext4_force_commit() makes sure any
+	 * ext4_handle_error() calls from the running transaction commit are
+	 * finished. Hence no new s_sb_upd_work can be queued after we
+	 * flush it here.
+	 */
+	ext4_set_mount_flag(sbi->s_sb, EXT4_JOURNAL_DESTORY);
+
+	ext4_force_commit(sbi->s_sb);
+	flush_work(&sbi->s_sb_upd_work);
+
 	err = jbd2_journal_destroy(journal);
 	sbi->s_journal = NULL;
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8ad664d47806..54ef0cc566a4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -704,9 +704,13 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
 		 * In case the fs should keep running, we need to writeout
 		 * superblock through the journal. Due to lock ordering
 		 * constraints, it may not be safe to do it right here so we
-		 * defer superblock flushing to a workqueue.
+		 * defer superblock flushing to a workqueue. We just need to be
+		 * careful when the journal is already shutting down. If we get
+		 * here in that case, just update the sb directly as the last
+		 * transaction won't commit anyway.
 		 */
-		if (continue_fs && journal)
+		if (continue_fs && journal &&
+		    !ext4_test_mount_flag(sb, EXT4_JOURNAL_DESTORY))
 			schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
 		else
 			ext4_commit_super(sb);
@@ -4957,8 +4961,6 @@ static int ext4_load_and_init_journal(struct super_block *sb,
 	return 0;
 
 out:
-	/* flush s_sb_upd_work before destroying the journal. */
-	flush_work(&sbi->s_sb_upd_work);
 	ext4_journal_destroy(sbi, sbi->s_journal);
 	return -EINVAL;
 }
@@ -5648,8 +5650,6 @@ failed_mount8: __maybe_unused
 	sbi->s_ea_block_cache = NULL;
 
 	if (sbi->s_journal) {
-		/* flush s_sb_upd_work before journal destroy. */
-		flush_work(&sbi->s_sb_upd_work);
 		ext4_journal_destroy(sbi, sbi->s_journal);
 	}
 failed_mount3a:
-- 
2.48.1
Re: [PATCH v3 2/3] ext4: avoid journaling sb update on error if journal is destroying
Posted by Jan Kara 9 months, 1 week ago
On Fri 14-03-25 13:14:10, Ojaswin Mujoo wrote:
> Presently we always BUG_ON if trying to start a transaction on a journal marked
> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> stress tests, it was observed that in case of some error handling paths, it is
> possible for update_super_work to start a transaction after the journal is
> destroyed eg:
> 
> (umount)
> ext4_kill_sb
>   kill_block_super
>     generic_shutdown_super
>       sync_filesystem /* commits all txns */
>       evict_inodes
>         /* might start a new txn */
>       ext4_put_super
> 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>         jbd2_journal_destroy
>           journal_kill_thread
>             journal->j_flags |= JBD2_UNMOUNT;
>           jbd2_journal_commit_transaction
>             jbd2_journal_get_descriptor_buffer
>               jbd2_journal_bmap
>                 ext4_journal_bmap
>                   ext4_map_blocks
>                     ...
>                     ext4_inode_error
>                       ext4_handle_error
>                         schedule_work(&sbi->s_sb_upd_work)
> 
>                                                /* work queue kicks in */
>                                                update_super_work
>                                                  jbd2_journal_start
>                                                    start_this_handle
>                                                      BUG_ON(journal->j_flags &
>                                                             JBD2_UNMOUNT)
> 
> Hence, introduce a new mount flag to indicate journal is destroying and only do
> a journaled (and deferred) update of sb if this flag is not set. Otherwise, just
> fallback to an un-journaled commit.
> 
> Further, in the journal destroy path, we have the following sequence:
> 
>   1. Set mount flag indicating journal is destroying
>   2. force a commit and wait for it
>   3. flush pending sb updates
> 
> This sequence is important as it ensures that, after this point, there is no sb
> update that might be journaled so it is safe to update the sb outside the
> journal. (To avoid race discussed in 2d01ddc86606)
> 
> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> called from mballoc and AFAICT it would be always valid to schedule work here.
> 
> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Mostly looks good. Couple of small comments below:

> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1179,6 +1179,7 @@ struct ext4_inode_info {
>  #define	EXT4_ERROR_FS			0x0002	/* Errors detected */
>  #define	EXT4_ORPHAN_FS			0x0004	/* Orphans being recovered */
>  #define EXT4_FC_REPLAY			0x0020	/* Fast commit replay ongoing */
> +#define EXT4_JOURNAL_DESTORY		0x0040	/* Journal is in process of destroying */

This should be defined as part of the following enum:

/*
 * run-time mount flags
 */
enum {
        EXT4_MF_MNTDIR_SAMPLED,
        EXT4_MF_FC_INELIGIBLE   /* Fast commit ineligible */
};

Also you have a typo in the flag name. I guess it should be
EXT4_MF_JOURNAL_DESTROY.

> @@ -4957,8 +4961,6 @@ static int ext4_load_and_init_journal(struct super_block *sb,
>  	return 0;
>  
>  out:
> -	/* flush s_sb_upd_work before destroying the journal. */
> -	flush_work(&sbi->s_sb_upd_work);
>  	ext4_journal_destroy(sbi, sbi->s_journal);
>  	return -EINVAL;
>  }
> @@ -5648,8 +5650,6 @@ failed_mount8: __maybe_unused
>  	sbi->s_ea_block_cache = NULL;
>  
>  	if (sbi->s_journal) {
> -		/* flush s_sb_upd_work before journal destroy. */
> -		flush_work(&sbi->s_sb_upd_work);
>  		ext4_journal_destroy(sbi, sbi->s_journal);
>  	}
>  failed_mount3a:

These are good. I would also move the flush_work() in ext4_put_super()
into else branch of:

	if (sbi->s_journal)

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
Re: [PATCH v3 2/3] ext4: avoid journaling sb update on error if journal is destroying
Posted by Ojaswin Mujoo 9 months, 1 week ago
On Fri, Mar 14, 2025 at 11:10:40AM +0100, Jan Kara wrote:
> On Fri 14-03-25 13:14:10, Ojaswin Mujoo wrote:
> > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > stress tests, it was observed that in case of some error handling paths, it is
> > possible for update_super_work to start a transaction after the journal is
> > destroyed eg:
> > 
> > (umount)
> > ext4_kill_sb
> >   kill_block_super
> >     generic_shutdown_super
> >       sync_filesystem /* commits all txns */
> >       evict_inodes
> >         /* might start a new txn */
> >       ext4_put_super
> > 	flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >         jbd2_journal_destroy
> >           journal_kill_thread
> >             journal->j_flags |= JBD2_UNMOUNT;
> >           jbd2_journal_commit_transaction
> >             jbd2_journal_get_descriptor_buffer
> >               jbd2_journal_bmap
> >                 ext4_journal_bmap
> >                   ext4_map_blocks
> >                     ...
> >                     ext4_inode_error
> >                       ext4_handle_error
> >                         schedule_work(&sbi->s_sb_upd_work)
> > 
> >                                                /* work queue kicks in */
> >                                                update_super_work
> >                                                  jbd2_journal_start
> >                                                    start_this_handle
> >                                                      BUG_ON(journal->j_flags &
> >                                                             JBD2_UNMOUNT)
> > 
> > Hence, introduce a new mount flag to indicate journal is destroying and only do
> > a journaled (and deferred) update of sb if this flag is not set. Otherwise, just
> > fallback to an un-journaled commit.
> > 
> > Further, in the journal destroy path, we have the following sequence:
> > 
> >   1. Set mount flag indicating journal is destroying
> >   2. force a commit and wait for it
> >   3. flush pending sb updates
> > 
> > This sequence is important as it ensures that, after this point, there is no sb
> > update that might be journaled so it is safe to update the sb outside the
> > journal. (To avoid race discussed in 2d01ddc86606)
> > 
> > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > called from mballoc and AFAICT it would be always valid to schedule work here.
> > 
> > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > Reported-by: Mahesh Kumar <maheshkumar657g@gmail.com>
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> Mostly looks good. Couple of small comments below:
> 
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1179,6 +1179,7 @@ struct ext4_inode_info {
> >  #define	EXT4_ERROR_FS			0x0002	/* Errors detected */
> >  #define	EXT4_ORPHAN_FS			0x0004	/* Orphans being recovered */
> >  #define EXT4_FC_REPLAY			0x0020	/* Fast commit replay ongoing */
> > +#define EXT4_JOURNAL_DESTORY		0x0040	/* Journal is in process of destroying */
> 
> This should be defined as part of the following enum:
> 
> /*
>  * run-time mount flags
>  */
> enum {
>         EXT4_MF_MNTDIR_SAMPLED,
>         EXT4_MF_FC_INELIGIBLE   /* Fast commit ineligible */
> };

Right, I'll fix that
> 
> Also you have a typo in the flag name. I guess it should be
> EXT4_MF_JOURNAL_DESTROY.

I ended up making the same typo as before :) 
> 
> > @@ -4957,8 +4961,6 @@ static int ext4_load_and_init_journal(struct super_block *sb,
> >  	return 0;
> >  
> >  out:
> > -	/* flush s_sb_upd_work before destroying the journal. */
> > -	flush_work(&sbi->s_sb_upd_work);
> >  	ext4_journal_destroy(sbi, sbi->s_journal);
> >  	return -EINVAL;
> >  }
> > @@ -5648,8 +5650,6 @@ failed_mount8: __maybe_unused
> >  	sbi->s_ea_block_cache = NULL;
> >  
> >  	if (sbi->s_journal) {
> > -		/* flush s_sb_upd_work before journal destroy. */
> > -		flush_work(&sbi->s_sb_upd_work);
> >  		ext4_journal_destroy(sbi, sbi->s_journal);
> >  	}
> >  failed_mount3a:
> 
> These are good. I would also move the flush_work() in ext4_put_super()
> into else branch of:
> 
> 	if (sbi->s_journal)

Right I thought of that but then let it be for simplicity. But sure I'll
add this in the next revision.

Thanks for the review Jan.
> 
> 								Honza
> 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR