Following process makes ext4 load stale buffer heads from last failed
mounting in a new mounting operation:
mount_bdev
ext4_fill_super
| ext4_load_and_init_journal
| ext4_load_journal
| jbd2_journal_load
| load_superblock
| journal_get_superblock
| set_buffer_verified(bh) // buffer head is verified
| jbd2_journal_recover // failed caused by EIO
| goto failed_mount3a // skip 'sb->s_root' initialization
deactivate_locked_super
kill_block_super
generic_shutdown_super
if (sb->s_root)
// false, skip ext4_put_super->invalidate_bdev->
// invalidate_mapping_pages->mapping_evict_folio->
// filemap_release_folio->try_to_free_buffers, which
// cannot drop buffer head.
blkdev_put
blkdev_put_whole
if (atomic_dec_and_test(&bdev->bd_openers))
// false, systemd-udev happens to open the device. Then
// blkdev_flush_mapping->kill_bdev->truncate_inode_pages->
// truncate_inode_folio->truncate_cleanup_folio->
// folio_invalidate->block_invalidate_folio->
// filemap_release_folio->try_to_free_buffers will be skipped,
// dropping buffer head is missed again.
Second mount:
ext4_fill_super
ext4_load_and_init_journal
ext4_load_journal
ext4_get_journal
jbd2_journal_init_inode
journal_init_common
bh = getblk_unmovable
bh = __find_get_block // Found stale bh in last failed mounting
journal->j_sb_buffer = bh
jbd2_journal_load
load_superblock
journal_get_superblock
if (buffer_verified(bh))
// true, skip journal->j_format_version = 2, value is 0
jbd2_journal_recover
do_one_pass
next_log_block += count_tags(journal, bh)
// According to journal_tag_bytes(), 'tag_bytes' calculating is
// affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3()
// returns false because 'j->j_format_version >= 2' is not true,
// then we get wrong next_log_block. The do_one_pass may exit
// early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'.
The filesystem is corrupted here, journal is partially replayed, and
new journal sequence number actually is already used by last mounting.
The invalidate_bdev() can drop all buffer heads even racing with bare
reading block device(eg. systemd-udev), so we can fix it by invalidating
bdev in error handling path in __ext4_fill_super().
Fetch a reproducer in [Link].
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171
Fixes: 25ed6e8a54df ("jbd2: enable journal clients to enable v2 checksumming")
Cc: stable@vger.kernel.org # v3.5
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
---
fs/ext4/super.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 88f7b8a88c76..7e990637bc48 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1126,6 +1126,12 @@ static void ext4_blkdev_remove(struct ext4_sb_info *sbi)
struct block_device *bdev;
bdev = sbi->s_journal_bdev;
if (bdev) {
+ /*
+ * Invalidate the journal device's buffers. We don't want them
+ * floating about in memory - the physical journal device may
+ * hotswapped, and it breaks the `ro-after' testing code.
+ */
+ invalidate_bdev(bdev);
ext4_blkdev_put(bdev);
sbi->s_journal_bdev = NULL;
}
@@ -1271,14 +1277,8 @@ static void ext4_put_super(struct super_block *sb)
sync_blockdev(sb->s_bdev);
invalidate_bdev(sb->s_bdev);
- if (sbi->s_journal_bdev && sbi->s_journal_bdev != sb->s_bdev) {
- /*
- * Invalidate the journal device's buffers. We don't want them
- * floating about in memory - the physical journal device may
- * hotswapped, and it breaks the `ro-after' testing code.
- */
+ if (sbi->s_journal_bdev) {
sync_blockdev(sbi->s_journal_bdev);
- invalidate_bdev(sbi->s_journal_bdev);
ext4_blkdev_remove(sbi);
}
@@ -5610,6 +5610,7 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
brelse(sbi->s_sbh);
ext4_blkdev_remove(sbi);
out_fail:
+ invalidate_bdev(sb->s_bdev);
sb->s_fs_info = NULL;
return err ? err : ret;
}
--
2.31.1
On Mon 13-03-23 21:20:17, Zhihao Cheng wrote:
> Following process makes ext4 load stale buffer heads from last failed
> mounting in a new mounting operation:
> mount_bdev
> ext4_fill_super
> | ext4_load_and_init_journal
> | ext4_load_journal
> | jbd2_journal_load
> | load_superblock
> | journal_get_superblock
> | set_buffer_verified(bh) // buffer head is verified
> | jbd2_journal_recover // failed caused by EIO
> | goto failed_mount3a // skip 'sb->s_root' initialization
> deactivate_locked_super
> kill_block_super
> generic_shutdown_super
> if (sb->s_root)
> // false, skip ext4_put_super->invalidate_bdev->
> // invalidate_mapping_pages->mapping_evict_folio->
> // filemap_release_folio->try_to_free_buffers, which
> // cannot drop buffer head.
> blkdev_put
> blkdev_put_whole
> if (atomic_dec_and_test(&bdev->bd_openers))
> // false, systemd-udev happens to open the device. Then
> // blkdev_flush_mapping->kill_bdev->truncate_inode_pages->
> // truncate_inode_folio->truncate_cleanup_folio->
> // folio_invalidate->block_invalidate_folio->
> // filemap_release_folio->try_to_free_buffers will be skipped,
> // dropping buffer head is missed again.
>
> Second mount:
> ext4_fill_super
> ext4_load_and_init_journal
> ext4_load_journal
> ext4_get_journal
> jbd2_journal_init_inode
> journal_init_common
> bh = getblk_unmovable
> bh = __find_get_block // Found stale bh in last failed mounting
> journal->j_sb_buffer = bh
> jbd2_journal_load
> load_superblock
> journal_get_superblock
> if (buffer_verified(bh))
> // true, skip journal->j_format_version = 2, value is 0
> jbd2_journal_recover
> do_one_pass
> next_log_block += count_tags(journal, bh)
> // According to journal_tag_bytes(), 'tag_bytes' calculating is
> // affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3()
> // returns false because 'j->j_format_version >= 2' is not true,
> // then we get wrong next_log_block. The do_one_pass may exit
> // early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'.
>
> The filesystem is corrupted here, journal is partially replayed, and
> new journal sequence number actually is already used by last mounting.
>
> The invalidate_bdev() can drop all buffer heads even racing with bare
> reading block device(eg. systemd-udev), so we can fix it by invalidating
> bdev in error handling path in __ext4_fill_super().
>
> Fetch a reproducer in [Link].
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171
> Fixes: 25ed6e8a54df ("jbd2: enable journal clients to enable v2 checksumming")
> Cc: stable@vger.kernel.org # v3.5
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
...
> @@ -1271,14 +1277,8 @@ static void ext4_put_super(struct super_block *sb)
>
> sync_blockdev(sb->s_bdev);
> invalidate_bdev(sb->s_bdev);
> - if (sbi->s_journal_bdev && sbi->s_journal_bdev != sb->s_bdev) {
> - /*
> - * Invalidate the journal device's buffers. We don't want them
> - * floating about in memory - the physical journal device may
> - * hotswapped, and it breaks the `ro-after' testing code.
> - */
> + if (sbi->s_journal_bdev) {
> sync_blockdev(sbi->s_journal_bdev);
> - invalidate_bdev(sbi->s_journal_bdev);
> ext4_blkdev_remove(sbi);
> }
Hum, but this will invalidate bhs only if journal is stored on a block
device. If journal is in the inode (the common case), we won't invalidate
anything (sbi->s_journal_bdev is NULL) and the same problem can happen?
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
在 2023/3/14 19:33, Jan Kara 写道:
Hi Jan,
>
>> @@ -1271,14 +1277,8 @@ static void ext4_put_super(struct super_block *sb)
>>
>> sync_blockdev(sb->s_bdev);
>> invalidate_bdev(sb->s_bdev);
For journal in the inode case, journal bhs come from block device, which
means buffers will be dropped after this line
'invalidate_bdev(sb->s_bdev)' being executed.
>> - if (sbi->s_journal_bdev && sbi->s_journal_bdev != sb->s_bdev) {
>> - /*
>> - * Invalidate the journal device's buffers. We don't want them
>> - * floating about in memory - the physical journal device may
>> - * hotswapped, and it breaks the `ro-after' testing code.
>> - */
>> + if (sbi->s_journal_bdev) {
>> sync_blockdev(sbi->s_journal_bdev);
>> - invalidate_bdev(sbi->s_journal_bdev);
>> ext4_blkdev_remove(sbi);
>> }
> > Hum, but this will invalidate bhs only if journal is stored on a block
> device. If journal is in the inode (the common case), we won't invalidate
> anything (sbi->s_journal_bdev is NULL) and the same problem can happen?
>
On Tue 14-03-23 20:01:46, Zhihao Cheng wrote:
> 在 2023/3/14 19:33, Jan Kara 写道:
> Hi Jan,
>
> >
> > > @@ -1271,14 +1277,8 @@ static void ext4_put_super(struct super_block *sb)
> > > sync_blockdev(sb->s_bdev);
> > > invalidate_bdev(sb->s_bdev);
>
> For journal in the inode case, journal bhs come from block device, which
> means buffers will be dropped after this line 'invalidate_bdev(sb->s_bdev)'
> being executed.
Right, I've missed that. But then why do you remove the sbi->s_journal_bdev
!= sb->s_bdev condition below?
> > > - if (sbi->s_journal_bdev && sbi->s_journal_bdev != sb->s_bdev) {
> > > - /*
> > > - * Invalidate the journal device's buffers. We don't want them
> > > - * floating about in memory - the physical journal device may
> > > - * hotswapped, and it breaks the `ro-after' testing code.
> > > - */
> > > + if (sbi->s_journal_bdev) {
> > > sync_blockdev(sbi->s_journal_bdev);
> > > - invalidate_bdev(sbi->s_journal_bdev);
> > > ext4_blkdev_remove(sbi);
> > > }
> > > Hum, but this will invalidate bhs only if journal is stored on a block
> > device. If journal is in the inode (the common case), we won't invalidate
> > anything (sbi->s_journal_bdev is NULL) and the same problem can happen?
> >
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
> On Tue 14-03-23 20:01:46, Zhihao Cheng wrote: >> 在 2023/3/14 19:33, Jan Kara 写道: >> Hi Jan, >> >>> >>>> @@ -1271,14 +1277,8 @@ static void ext4_put_super(struct super_block *sb) >>>> sync_blockdev(sb->s_bdev); >>>> invalidate_bdev(sb->s_bdev); >> >> For journal in the inode case, journal bhs come from block device, which >> means buffers will be dropped after this line 'invalidate_bdev(sb->s_bdev)' >> being executed. > > Right, I've missed that. But then why do you remove the sbi->s_journal_bdev > != sb->s_bdev condition below? > I think 'sbi->s_journal_bdev != sb->s_bdev' always becomes true if sbi->s_journal_bdev exists. mount_bdev fmode_t mode = FMODE_READ | FMODE_EXCL bdev_a = blkdev_get_by_path(dev_name, mode, fs_type) mount_bdev->ext4_fill_super->ext4_load_and_init_journal->ext4_load_journal->ext4_get_dev_journal: bdev_b = ext4_blkdev_get(j_dev, sb) bdev_b = blkdev_get_by_dev(dev, FMODE_READ|FMODE_WRITE|FMODE_EXCL, sb) EXT4_SB(sb)->s_journal_bdev = bdev_b bdev_a cannot be bdev_b, because bd_prepare_to_claim() makes sure the same block device cannot be openned twice with mode 'FMODE_EXCL'.
On Tue 14-03-23 20:31:43, Zhihao Cheng wrote: > > On Tue 14-03-23 20:01:46, Zhihao Cheng wrote: > > > 在 2023/3/14 19:33, Jan Kara 写道: > > > Hi Jan, > > > > > > > > > > > > @@ -1271,14 +1277,8 @@ static void ext4_put_super(struct super_block *sb) > > > > > sync_blockdev(sb->s_bdev); > > > > > invalidate_bdev(sb->s_bdev); > > > > > > For journal in the inode case, journal bhs come from block device, which > > > means buffers will be dropped after this line 'invalidate_bdev(sb->s_bdev)' > > > being executed. > > > > Right, I've missed that. But then why do you remove the sbi->s_journal_bdev > > != sb->s_bdev condition below? > > > > I think 'sbi->s_journal_bdev != sb->s_bdev' always becomes true if > sbi->s_journal_bdev exists. OK, fair point. But please move this cleanup into a separate commit with this justification. Thanks! Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
© 2016 - 2026 Red Hat, Inc.