From nobody Sat Jun 13 03:30:33 2026 Received: from sender4-op-o15.zoho.com (sender4-op-o15.zoho.com [136.143.188.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 598F454652; Mon, 11 May 2026 08:44:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.15 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489069; cv=pass; b=gyqGNqdwUJmmUW9b5QNqT+GBhStYuO88e2nkQT+utGGMkCtroPvtEEcaKrnCoOvF01BtaKu6U2TRB28L42MBbGOLuA4dYtFoP2TWI0A3s0SpiPcCUjR6dHUYb3QqxS2SlyFwN9udHxg5dbPi0Y0NwumsLYoaPHvirnq+JxlS/7U= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489069; c=relaxed/simple; bh=QV7kIdsBd8OAM1Fqyg5OdjxU2Gv4x5QGOU+ajcd8jkc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RVn7QNFjf33zUP6JzPMe2VbWtZTQI9xTowr4GLGkLLYaEkKvydOaS0265o3g3EakrioB7F14dt9krzJoHs643oCfg5HKF6jmrE0r0GGk8Dv2ee1rp7/CHhXUCZB6K3hEticcD697y3xLd7OJHkMdrfAjddL2ZAjiG90O1zpUb/w= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=DU0whATC; arc=pass smtp.client-ip=136.143.188.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="DU0whATC" ARC-Seal: i=1; a=rsa-sha256; t=1778489016; cv=none; d=zohomail.com; s=zohoarc; b=cCia7XWnaJ0V+dNPrfzvm2DB9tIF5sd4X2kQIa+3iLVREug/IQ0Ai84cEWUKOtmOSP9Uw4Z/PdGDGWXvWnmvK/NuwU5KYtHYwlumMPqgpfL52USou92tb6r+HU6oJR5ttvq4oa+1t7N7zfh0d3G0bnGJdtXGBFG8HZFZQ3AxCHU= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1778489016; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; bh=yyS5Gbf2MMEaU6It3q8Bt3GloHNnxWhLYasDvwZ3AnU=; b=V0Ksxq+WSjIQ5v63T/knX5Di9o11viuANY12+vPPW+l3r4i6IXWcLV05NpGdhLJIsE+JmEGJtPcMEHbNM0CgJ6KSuhv13alNrOxjkbwCflsub2TRZRoTrNCOICxUJqdu94+EpdFFlJKzyru8XNFr10PovCXo1RZj2H9lM/KdVVg= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=linux.beauty; spf=pass smtp.mailfrom=me@linux.beauty; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1778489016; s=zmail; d=linux.beauty; i=me@linux.beauty; h=From:From:To:To:Cc:Cc:Subject:Subject:Date:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Transfer-Encoding:Message-Id:Reply-To; bh=yyS5Gbf2MMEaU6It3q8Bt3GloHNnxWhLYasDvwZ3AnU=; b=DU0whATCc+Dc4iUYe3tAFLy54YNEDnHC0XT9tNwE+rt6q8gU45BCVIg70m1+h/hy NJy4oal6SUwRmrGKKpreWtPAlqwrWk+7vLf7W6frsdOfBTWwGXxV+Gl8OkQ82dI9siw 8YTRRHFRz51YDgGbsQMooVR4fZcKN8soVKeZBQ0I= Received: by mx.zohomail.com with SMTPS id 17784890138221020.5425210051374; Mon, 11 May 2026 01:43:33 -0700 (PDT) From: Li Chen To: Zhang Yi , Theodore Ts'o , Andreas Dilger , Baokun Li , Jan Kara , Ojaswin Mujoo , "Ritesh Harjani (IBM)" , Zhang Yi , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org Subject: [RFC v7 1/7] ext4: fast commit: snapshot inode state before writing log Date: Mon, 11 May 2026 16:42:56 +0800 Message-ID: <20260511084304.1559557-2-me@linux.beauty> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260511084304.1559557-1-me@linux.beauty> References: <20260511084304.1559557-1-me@linux.beauty> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Content-Type: text/plain; charset="utf-8" Fast commit writes inode metadata and data range updates after unlocking journal updates. New handles can start at that point, so the log writing path must not look at live inode state. Add a commit-time per-inode snapshot and populate it while journal updates are locked and existing handles are drained. Store the snapshot behind ext4_inode_info->i_fc_snap so ext4_inode_info only grows by one pointer. The snapshot contains a copy of the on-disk inode plus the data range records needed for fast commit TLVs. Snapshotting runs under jbd2_journal_lock_updates(). Avoid triggering I/O there by using ext4_get_inode_loc_noio() and falling back to full commit if the inode table block is not present or not uptodate. Log writing then only serializes the snapshot, so it no longer needs to call ext4_map_blocks() and take i_data_sem under s_fc_lock. The snapshot is installed and freed under s_fc_lock and is released from fast commit cleanup and inode eviction. Signed-off-by: Li Chen --- Changes in v7: - Drop the stale i_fc_wait initialization after rebasing onto the new linux-next base. Changes in v6: - Rebase onto linux-next master as of 2026-04-08. - Fix the inode debug print format after rebasing. fs/ext4/ext4.h | 22 ++- fs/ext4/fast_commit.c | 331 +++++++++++++++++++++++++++++++++++------- fs/ext4/inode.c | 51 +++++++ 3 files changed, 352 insertions(+), 52 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 94283a991e5c..e01d00dbc077 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1023,6 +1023,7 @@ enum { I_DATA_SEM_EA }; =20 +struct ext4_fc_inode_snap; =20 /* * fourth extended file system inode data in memory @@ -1079,6 +1080,22 @@ struct ext4_inode_info { /* End of lblk range that needs to be committed in this fast commit */ ext4_lblk_t i_fc_lblk_len; =20 + /* + * Commit-time fast commit snapshots. + * + * i_fc_snap is installed and freed under sbi->s_fc_lock. The fast + * commit log writing path reads the snapshot under sbi->s_fc_lock while + * serializing fast commit TLVs. + * + * The snapshot lifetime is bounded by EXT4_STATE_FC_COMMITTING and the + * corresponding cleanup / eviction paths. + * + * i_fc_snap points to per-inode snapshot data for fast commit: + * - a raw inode snapshot for EXT4_FC_TAG_INODE + * - data range records for EXT4_FC_TAG_{ADD,DEL}_RANGE + */ + struct ext4_fc_inode_snap *i_fc_snap; + spinlock_t i_raw_lock; /* protects updates to the raw inode */ =20 /* @@ -3080,8 +3097,9 @@ extern int ext4_file_getattr(struct mnt_idmap *, con= st struct path *, struct kstat *, u32, unsigned int); extern void ext4_dirty_inode(struct inode *, int); extern int ext4_change_inode_journal_flag(struct inode *, int); -extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); -extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino, +int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc); +int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc); +int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino, struct ext4_iloc *iloc); extern int ext4_inode_attach_jinode(struct inode *inode); extern int ext4_can_truncate(struct inode *inode); diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index b3c22636251d..cd4eac4e7dcb 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -56,21 +56,23 @@ * deleted while it is being flushed. * [2] Flush data buffers to disk and clear "EXT4_STATE_FC_FLUSHING_DATA" * state. - * [3] Lock the journal by calling jbd2_journal_lock_updates. This ensures= that - * all the exsiting handles finish and no new handles can start. - * [4] Mark all the fast commit eligible inodes as undergoing fast commit - * by setting "EXT4_STATE_FC_COMMITTING" state. - * [5] Unlock the journal by calling jbd2_journal_unlock_updates. This all= ows + * [3] Lock the journal by calling jbd2_journal_lock_updates(). This ensur= es + * that all the existing handles finish and no new handles can start. + * [4] Mark all the fast commit eligible inodes as undergoing fast commit = by + * setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode st= ate + * needed for log writing. + * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This a= llows * starting of new handles. If new handles try to start an update on * any of the inodes that are being committed, ext4_fc_track_inode() * will block until those inodes have finished the fast commit. * [6] Commit all the directory entry updates in the fast commit space. - * [7] Commit all the changed inodes in the fast commit space and clear - * "EXT4_STATE_FC_COMMITTING" for these inodes. + * [7] Commit all the changed inodes in the fast commit space. * [8] Write tail tag (this tag ensures the atomicity, please read the fol= lowing * section for more details). + * [9] Clear "EXT4_STATE_FC_COMMITTING" and wake up waiters in + * ext4_fc_cleanup(). * - * All the inode updates must be enclosed within jbd2_jounrnal_start() + * All the inode updates must be enclosed within jbd2_journal_start() * and jbd2_journal_stop() similar to JBD2 journaling. * * Fast Commit Ineligibility @@ -200,6 +202,8 @@ static void ext4_end_buffer_io_sync(struct buffer_head = *bh, int uptodate) unlock_buffer(bh); } =20 +static void ext4_fc_free_inode_snap(struct inode *inode); + static inline void ext4_fc_reset_inode(struct inode *inode) { struct ext4_inode_info *ei =3D EXT4_I(inode); @@ -216,6 +220,7 @@ void ext4_fc_init_inode(struct inode *inode) ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING); INIT_LIST_HEAD(&ei->i_fc_list); INIT_LIST_HEAD(&ei->i_fc_dilist); + ei->i_fc_snap =3D NULL; } =20 static bool ext4_fc_disabled(struct super_block *sb) @@ -246,6 +251,7 @@ void ext4_fc_del(struct inode *inode) =20 alloc_ctx =3D ext4_fc_lock(inode->i_sb); if (list_empty(&ei->i_fc_list) && list_empty(&ei->i_fc_dilist)) { + ext4_fc_free_inode_snap(inode); ext4_fc_unlock(inode->i_sb, alloc_ctx); return; } @@ -287,6 +293,7 @@ void ext4_fc_del(struct inode *inode) } finish_wait(wq, &wait.wq_entry); } + ext4_fc_free_inode_snap(inode); list_del_init(&ei->i_fc_list); =20 /* @@ -829,6 +836,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block = *sb, u32 *crc, return true; } =20 +struct ext4_fc_range { + struct list_head list; + u16 tag; + ext4_lblk_t lblk; + ext4_lblk_t len; + ext4_fsblk_t pblk; + bool unwritten; +}; + +struct ext4_fc_inode_snap { + struct list_head data_list; + unsigned int inode_len; + u8 inode_buf[]; +}; + /* * Writes inode in the fast commit space under TLV with tag @tag. * Returns 0 on success, error on failure. @@ -836,21 +858,21 @@ static bool ext4_fc_add_dentry_tlv(struct super_block= *sb, u32 *crc, static int ext4_fc_write_inode(struct inode *inode, u32 *crc) { struct ext4_inode_info *ei =3D EXT4_I(inode); - int inode_len =3D EXT4_GOOD_OLD_INODE_SIZE; - int ret; - struct ext4_iloc iloc; + struct ext4_fc_inode_snap *snap =3D ei->i_fc_snap; struct ext4_fc_inode fc_inode; struct ext4_fc_tl tl; u8 *dst; + u8 *src; + int inode_len; + int ret; =20 - ret =3D ext4_get_inode_loc(inode, &iloc); - if (ret) - return ret; + if (!snap) + return -ECANCELED; =20 - if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA)) - inode_len =3D EXT4_INODE_SIZE(inode->i_sb); - else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) - inode_len +=3D ei->i_extra_isize; + src =3D snap->inode_buf; + inode_len =3D snap->inode_len; + if (!src || inode_len =3D=3D 0) + return -ECANCELED; =20 fc_inode.fc_ino =3D cpu_to_le32(inode->i_ino); tl.fc_tag =3D cpu_to_le16(EXT4_FC_TAG_INODE); @@ -866,10 +888,9 @@ static int ext4_fc_write_inode(struct inode *inode, u3= 2 *crc) dst +=3D EXT4_FC_TAG_BASE_LEN; memcpy(dst, &fc_inode, sizeof(fc_inode)); dst +=3D sizeof(fc_inode); - memcpy(dst, (u8 *)ext4_raw_inode(&iloc), inode_len); + memcpy(dst, src, inode_len); ret =3D 0; err: - brelse(iloc.bh); return ret; } =20 @@ -879,12 +900,74 @@ static int ext4_fc_write_inode(struct inode *inode, u= 32 *crc) */ static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc) { - ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size; struct ext4_inode_info *ei =3D EXT4_I(inode); - struct ext4_map_blocks map; + struct ext4_fc_inode_snap *snap =3D ei->i_fc_snap; struct ext4_fc_add_range fc_ext; struct ext4_fc_del_range lrange; struct ext4_extent *ex; + struct ext4_fc_range *range; + + if (!snap) + return -ECANCELED; + + list_for_each_entry(range, &snap->data_list, list) { + if (range->tag =3D=3D EXT4_FC_TAG_DEL_RANGE) { + lrange.fc_ino =3D cpu_to_le32(inode->i_ino); + lrange.fc_lblk =3D cpu_to_le32(range->lblk); + lrange.fc_len =3D cpu_to_le32(range->len); + if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE, + sizeof(lrange), (u8 *)&lrange, crc)) + return -ENOSPC; + continue; + } + + fc_ext.fc_ino =3D cpu_to_le32(inode->i_ino); + ex =3D (struct ext4_extent *)&fc_ext.fc_ex; + ex->ee_block =3D cpu_to_le32(range->lblk); + ex->ee_len =3D cpu_to_le16(range->len); + ext4_ext_store_pblock(ex, range->pblk); + if (range->unwritten) + ext4_ext_mark_unwritten(ex); + else + ext4_ext_mark_initialized(ex); + + if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE, + sizeof(fc_ext), (u8 *)&fc_ext, crc)) + return -ENOSPC; + } + + return 0; +} + +static void ext4_fc_free_ranges(struct list_head *head) +{ + struct ext4_fc_range *range, *range_n; + + list_for_each_entry_safe(range, range_n, head, list) { + list_del(&range->list); + kfree(range); + } +} + +static void ext4_fc_free_inode_snap(struct inode *inode) +{ + struct ext4_inode_info *ei =3D EXT4_I(inode); + struct ext4_fc_inode_snap *snap =3D ei->i_fc_snap; + + if (!snap) + return; + + ext4_fc_free_ranges(&snap->data_list); + kfree(snap); + ei->i_fc_snap =3D NULL; +} + +static int ext4_fc_snapshot_inode_data(struct inode *inode, + struct list_head *ranges) +{ + struct ext4_inode_info *ei =3D EXT4_I(inode); + ext4_lblk_t start_lblk, end_lblk, cur_lblk; + struct ext4_map_blocks map; int ret; =20 spin_lock(&ei->i_fc_lock); @@ -892,18 +975,21 @@ static int ext4_fc_write_inode_data(struct inode *ino= de, u32 *crc) spin_unlock(&ei->i_fc_lock); return 0; } - old_blk_size =3D ei->i_fc_lblk_start; - new_blk_size =3D ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1; + start_lblk =3D ei->i_fc_lblk_start; + end_lblk =3D ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1; ei->i_fc_lblk_len =3D 0; spin_unlock(&ei->i_fc_lock); =20 - cur_lblk_off =3D old_blk_size; - ext4_debug("will try writing %d to %d for inode %llu\n", - cur_lblk_off, new_blk_size, inode->i_ino); + cur_lblk =3D start_lblk; + ext4_debug("snapshot data ranges %u-%u for inode %llu\n", + start_lblk, end_lblk, + (unsigned long long)inode->i_ino); + + while (cur_lblk <=3D end_lblk) { + struct ext4_fc_range *range; =20 - while (cur_lblk_off <=3D new_blk_size) { - map.m_lblk =3D cur_lblk_off; - map.m_len =3D new_blk_size - cur_lblk_off + 1; + map.m_lblk =3D cur_lblk; + map.m_len =3D end_lblk - cur_lblk + 1; ret =3D ext4_map_blocks(NULL, inode, &map, EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE); @@ -911,17 +997,21 @@ static int ext4_fc_write_inode_data(struct inode *ino= de, u32 *crc) return -ECANCELED; =20 if (map.m_len =3D=3D 0) { - cur_lblk_off++; + cur_lblk++; continue; } =20 + range =3D kmalloc(sizeof(*range), GFP_NOFS); + if (!range) + return -ENOMEM; + + range->lblk =3D map.m_lblk; + range->len =3D map.m_len; + range->pblk =3D 0; + range->unwritten =3D false; + if (ret =3D=3D 0) { - lrange.fc_ino =3D cpu_to_le32(inode->i_ino); - lrange.fc_lblk =3D cpu_to_le32(map.m_lblk); - lrange.fc_len =3D cpu_to_le32(map.m_len); - if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE, - sizeof(lrange), (u8 *)&lrange, crc)) - return -ENOSPC; + range->tag =3D EXT4_FC_TAG_DEL_RANGE; } else { unsigned int max =3D (map.m_flags & EXT4_MAP_UNWRITTEN) ? EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN; @@ -929,26 +1019,67 @@ static int ext4_fc_write_inode_data(struct inode *in= ode, u32 *crc) /* Limit the number of blocks in one extent */ map.m_len =3D min(max, map.m_len); =20 - fc_ext.fc_ino =3D cpu_to_le32(inode->i_ino); - ex =3D (struct ext4_extent *)&fc_ext.fc_ex; - ex->ee_block =3D cpu_to_le32(map.m_lblk); - ex->ee_len =3D cpu_to_le16(map.m_len); - ext4_ext_store_pblock(ex, map.m_pblk); - if (map.m_flags & EXT4_MAP_UNWRITTEN) - ext4_ext_mark_unwritten(ex); - else - ext4_ext_mark_initialized(ex); - if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE, - sizeof(fc_ext), (u8 *)&fc_ext, crc)) - return -ENOSPC; + range->tag =3D EXT4_FC_TAG_ADD_RANGE; + range->len =3D map.m_len; + range->pblk =3D map.m_pblk; + range->unwritten =3D !!(map.m_flags & EXT4_MAP_UNWRITTEN); } =20 - cur_lblk_off +=3D map.m_len; + INIT_LIST_HEAD(&range->list); + list_add_tail(&range->list, ranges); + + cur_lblk +=3D map.m_len; } =20 return 0; } =20 +static int ext4_fc_snapshot_inode(struct inode *inode) +{ + struct ext4_inode_info *ei =3D EXT4_I(inode); + struct ext4_fc_inode_snap *snap; + int inode_len =3D EXT4_GOOD_OLD_INODE_SIZE; + struct ext4_iloc iloc; + LIST_HEAD(ranges); + int ret; + int alloc_ctx; + + ret =3D ext4_get_inode_loc_noio(inode, &iloc); + if (ret) + return ret; + + if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA)) + inode_len =3D EXT4_INODE_SIZE(inode->i_sb); + else if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) + inode_len +=3D ei->i_extra_isize; + + snap =3D kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS); + if (!snap) { + brelse(iloc.bh); + return -ENOMEM; + } + INIT_LIST_HEAD(&snap->data_list); + snap->inode_len =3D inode_len; + + memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len); + brelse(iloc.bh); + + ret =3D ext4_fc_snapshot_inode_data(inode, &ranges); + if (ret) { + kfree(snap); + ext4_fc_free_ranges(&ranges); + return ret; + } + + alloc_ctx =3D ext4_fc_lock(inode->i_sb); + ext4_fc_free_inode_snap(inode); + ei->i_fc_snap =3D snap; + list_splice_tail_init(&ranges, &snap->data_list); + ext4_fc_unlock(inode->i_sb, alloc_ctx); + + return 0; +} + =20 /* Flushes data of all the inodes in the commit queue. */ static int ext4_fc_flush_data(journal_t *journal) @@ -999,6 +1130,11 @@ static int ext4_fc_commit_dentry_updates(journal_t *j= ournal, u32 *crc) */ if (list_empty(&fc_dentry->fcd_dilist)) continue; + /* + * For EXT4_FC_TAG_CREAT, fcd_dilist is linked on the created + * inode's i_fc_dilist list (kept singular), so we can recover the + * inode through it. + */ ei =3D list_first_entry(&fc_dentry->fcd_dilist, struct ext4_inode_info, i_fc_dilist); inode =3D &ei->vfs_inode; @@ -1023,6 +1159,88 @@ static int ext4_fc_commit_dentry_updates(journal_t *= journal, u32 *crc) return 0; } =20 +static int ext4_fc_snapshot_inodes(journal_t *journal) +{ + struct super_block *sb =3D journal->j_private; + struct ext4_sb_info *sbi =3D EXT4_SB(sb); + struct ext4_inode_info *iter; + struct ext4_fc_dentry_update *fc_dentry; + struct inode **inodes; + unsigned int nr_inodes =3D 0; + unsigned int i =3D 0; + int ret =3D 0; + int alloc_ctx; + + alloc_ctx =3D ext4_fc_lock(sb); + list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) + nr_inodes++; + + list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) { + struct ext4_inode_info *ei; + + if (fc_dentry->fcd_op !=3D EXT4_FC_TAG_CREAT) + continue; + if (list_empty(&fc_dentry->fcd_dilist)) + continue; + + /* See the comment in ext4_fc_commit_dentry_updates(). */ + ei =3D list_first_entry(&fc_dentry->fcd_dilist, + struct ext4_inode_info, i_fc_dilist); + if (!list_empty(&ei->i_fc_list)) + continue; + + nr_inodes++; + } + ext4_fc_unlock(sb, alloc_ctx); + + if (!nr_inodes) + return 0; + + inodes =3D kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS); + if (!inodes) + return -ENOMEM; + + alloc_ctx =3D ext4_fc_lock(sb); + list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { + inodes[i] =3D igrab(&iter->vfs_inode); + if (inodes[i]) + i++; + } + + list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) { + struct ext4_inode_info *ei; + + if (fc_dentry->fcd_op !=3D EXT4_FC_TAG_CREAT) + continue; + if (list_empty(&fc_dentry->fcd_dilist)) + continue; + + /* See the comment in ext4_fc_commit_dentry_updates(). */ + ei =3D list_first_entry(&fc_dentry->fcd_dilist, + struct ext4_inode_info, i_fc_dilist); + if (!list_empty(&ei->i_fc_list)) + continue; + + inodes[i] =3D igrab(&ei->vfs_inode); + if (inodes[i]) + i++; + } + ext4_fc_unlock(sb, alloc_ctx); + + for (nr_inodes =3D 0; nr_inodes < i; nr_inodes++) { + ret =3D ext4_fc_snapshot_inode(inodes[nr_inodes]); + if (ret) + break; + } + + for (nr_inodes =3D 0; nr_inodes < i; nr_inodes++) { + if (inodes[nr_inodes]) + iput(inodes[nr_inodes]); + } + kvfree(inodes); + return ret; +} + static int ext4_fc_perform_commit(journal_t *journal) { struct super_block *sb =3D journal->j_private; @@ -1095,7 +1313,11 @@ static int ext4_fc_perform_commit(journal_t *journal) EXT4_STATE_FC_COMMITTING); } ext4_fc_unlock(sb, alloc_ctx); + + ret =3D ext4_fc_snapshot_inodes(journal); jbd2_journal_unlock_updates(journal); + if (ret) + return ret; =20 /* * Step 5: If file system device is different from journal device, @@ -1292,6 +1514,7 @@ static void ext4_fc_cleanup(journal_t *journal, int f= ull, tid_t tid) struct ext4_inode_info, i_fc_list); list_del_init(&ei->i_fc_list); + ext4_fc_free_inode_snap(&ei->vfs_inode); ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING); if (tid_geq(tid, ei->i_sync_tid)) { @@ -1327,6 +1550,14 @@ static void ext4_fc_cleanup(journal_t *journal, int = full, tid_t tid) struct ext4_fc_dentry_update, fcd_list); list_del_init(&fc_dentry->fcd_list); + if (fc_dentry->fcd_op =3D=3D EXT4_FC_TAG_CREAT && + !list_empty(&fc_dentry->fcd_dilist)) { + /* See the comment in ext4_fc_commit_dentry_updates(). */ + ei =3D list_first_entry(&fc_dentry->fcd_dilist, + struct ext4_inode_info, + i_fc_dilist); + ext4_fc_free_inode_snap(&ei->vfs_inode); + } list_del_init(&fc_dentry->fcd_dilist); =20 release_dentry_name_snapshot(&fc_dentry->fcd_name); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c2c2d6ac7f3d..4678612f82e8 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5025,6 +5025,57 @@ int ext4_get_inode_loc(struct inode *inode, struct e= xt4_iloc *iloc) return ret; } =20 +/* + * ext4_get_inode_loc_noio() is a best-effort variant of ext4_get_inode_lo= c(). + * It looks up the inode table block in the buffer cache and returns -EAGA= IN if + * the block is not present or not uptodate, without starting any I/O. + */ +int ext4_get_inode_loc_noio(struct inode *inode, struct ext4_iloc *iloc) +{ + struct super_block *sb =3D inode->i_sb; + struct ext4_group_desc *gdp; + struct buffer_head *bh; + ext4_fsblk_t block; + int inodes_per_block, inode_offset; + unsigned long ino =3D inode->i_ino; + + iloc->bh =3D NULL; + if (ino < EXT4_ROOT_INO || + ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count)) + return -EFSCORRUPTED; + + iloc->block_group =3D (ino - 1) / EXT4_INODES_PER_GROUP(sb); + gdp =3D ext4_get_group_desc(sb, iloc->block_group, NULL); + if (!gdp) + return -EIO; + + /* Figure out the offset within the block group inode table. */ + inodes_per_block =3D EXT4_SB(sb)->s_inodes_per_block; + inode_offset =3D ((ino - 1) % EXT4_INODES_PER_GROUP(sb)); + iloc->offset =3D (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb); + + block =3D ext4_inode_table(sb, gdp); + if (block <=3D le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block) || + block >=3D ext4_blocks_count(EXT4_SB(sb)->s_es)) { + ext4_error(sb, + "Invalid inode table block %llu in block_group %u", + block, iloc->block_group); + return -EFSCORRUPTED; + } + block +=3D inode_offset / inodes_per_block; + + bh =3D sb_find_get_block(sb, block); + if (!bh) + return -EAGAIN; + if (!ext4_buffer_uptodate(bh)) { + brelse(bh); + return -EAGAIN; + } + + iloc->bh =3D bh; + return 0; +} + =20 int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino, struct ext4_iloc *iloc) --=20 2.53.0 From nobody Sat Jun 13 03:30:33 2026 Received: from sender4-op-o15.zoho.com (sender4-op-o15.zoho.com [136.143.188.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D4C213A257A; Mon, 11 May 2026 08:45:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.15 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489104; cv=pass; b=GUjnEQqTJ4QojQ4DAPptGMyOrmc4AfwHjXQeKdfE9yX8blxIwu1kPoSdyL6StzG7/Wdup03JsrkRZlRWHBvjQbIWZzB1i31s75wRUhTLmgY1Zb2ge7KVrZqAZpBoVOrK687K0GKEM/Mz5TGM+roeQmxFnjH9Ule/XZBZl+T6V80= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489104; c=relaxed/simple; bh=ChWYJXEq+lFSmys907g8j0n4wRymG24Yq7Xh+QFIbEs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=IxLe/w6WAE/GTd1reXD5cbKcyDyzcb/mlt/Hs9fONGkTnZrHGv+BQEHBqQTM4+F/pMZaBTOQeE8Suo71x9zCWoEbozQ+hcfCQbbLPzJ7sIGhH8MH8Vm6st2e9Pi9jEcUjE8v60CEZnXVQOg37bnmX0ifiOuHU1j0G1RUCIoCHNI= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=X4l7GTFS; arc=pass smtp.client-ip=136.143.188.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="X4l7GTFS" ARC-Seal: i=1; a=rsa-sha256; t=1778489023; cv=none; d=zohomail.com; s=zohoarc; b=XBwMwB2/y9MdGOhfbArGkyaXszYIkCqlN9v/0F4PyrPi/z3jzSsKI1FoR1W/X3X3qMrmHLinQ5vEyRPrxH9kniL0DvoGd9z3cSO9FY3Ekwb0sbXAcnPRVqQt7PiecglsuR3M22TurXtJ5g51Uz+6HUvxtRv7gxoxJbTcTPwQLZM= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1778489023; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; bh=PbH7XFK/c7E9Ejy2BVWdbwpFXmuC5Rq+dtQ0ZQ6Xx18=; b=hB1mEcT/7FHNv4iqYZXkFrmbYfJxwvLdbLrzShxSv9Xio366qFSPsfj4q7MKcQvMjwIPUvCyl+lOnG+BZG9NOOwKY0wEGLCXf8wEM6Lj0NmDBiJI/bW7FgOg5VbOYu4gq93DZhqTqTc8ZdjUc2t3DCbc4/xHjqsHsFeA08mXxR4= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=linux.beauty; spf=pass smtp.mailfrom=me@linux.beauty; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1778489022; s=zmail; d=linux.beauty; i=me@linux.beauty; h=From:From:To:To:Cc:Cc:Subject:Subject:Date:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Transfer-Encoding:Message-Id:Reply-To; bh=PbH7XFK/c7E9Ejy2BVWdbwpFXmuC5Rq+dtQ0ZQ6Xx18=; b=X4l7GTFSu9paPCzyEbm22I6++OFzB3/ef0hADouOGl5eQdPWIIpHZEUng1pM5ExS AWaGqHe8UrDHyrG84TFyj2JlfHk4Xq9DKbwl0Vl4UhM8viDjNLwlU8aNP0DJ5ktHnNE eoOpLTY/nWBLx2ldg+TdF0xM54UagXxyvXqItcY8= Received: by mx.zohomail.com with SMTPS id 1778489019274642.7961050691373; Mon, 11 May 2026 01:43:39 -0700 (PDT) From: Li Chen To: Zhang Yi , Theodore Ts'o , Andreas Dilger , Baokun Li , Jan Kara , Ojaswin Mujoo , "Ritesh Harjani (IBM)" , Zhang Yi , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org Subject: [RFC v7 2/7] ext4: lockdep: handle i_data_sem subclassing for special inodes Date: Mon, 11 May 2026 16:42:57 +0800 Message-ID: <20260511084304.1559557-3-me@linux.beauty> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260511084304.1559557-1-me@linux.beauty> References: <20260511084304.1559557-1-me@linux.beauty> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Content-Type: text/plain; charset="utf-8" Fast commit can hold s_fc_lock while writing journal blocks. Mapping the journal inode can take its i_data_sem. Normal inode update paths can take a data inode i_data_sem and then s_fc_lock, which makes lockdep report a circular dependency. lockdep treats all i_data_sem instances as one lock class and cannot distinguish the journal inode i_data_sem from a regular inode i_data_sem. The journal inode is not tracked by fast commit and no FC waiters ever depend on it, so this is not a real ABBA deadlock. Assign the journal inode a dedicated i_data_sem lockdep subclass to avoid the false positive. Inode cache objects can be recycled, so also reset i_data_sem to I_DATA_SEM_NORMAL when allocating an ext4 inode. Otherwise a new inode may inherit an old subclass (journal/quota/ea) and trigger lockdep warnings. Signed-off-by: Li Chen --- Changes in v6: - Rebase onto linux-next master as of 2026-04-08. - Refresh the patch context around upstream ext4_alloc_inode() changes, without changing the subclassing logic. fs/ext4/ext4.h | 4 +++- fs/ext4/super.c | 8 ++++++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index e01d00dbc077..05c8f67625b4 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1015,12 +1015,14 @@ do { \ * than the first * I_DATA_SEM_QUOTA - Used for quota inodes only * I_DATA_SEM_EA - Used for ea_inodes only + * I_DATA_SEM_JOURNAL - Used for journal inode only */ enum { I_DATA_SEM_NORMAL =3D 0, I_DATA_SEM_OTHER, I_DATA_SEM_QUOTA, - I_DATA_SEM_EA + I_DATA_SEM_EA, + I_DATA_SEM_JOURNAL }; =20 struct ext4_fc_inode_snap; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 6a77db4d3124..3c869f0001c5 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1431,6 +1431,9 @@ static struct inode *ext4_alloc_inode(struct super_bl= ock *sb) ext4_fc_init_inode(&ei->vfs_inode); spin_lock_init(&ei->i_fc_lock); mmb_init(&ei->i_metadata_bhs, &ei->vfs_inode.i_data); +#ifdef CONFIG_LOCKDEP + lockdep_set_subclass(&ei->i_data_sem, I_DATA_SEM_NORMAL); +#endif return &ei->vfs_inode; } =20 @@ -5910,6 +5913,11 @@ static struct inode *ext4_get_journal_inode(struct s= uper_block *sb, return ERR_PTR(-EFSCORRUPTED); } =20 +#ifdef CONFIG_LOCKDEP + lockdep_set_subclass(&EXT4_I(journal_inode)->i_data_sem, + I_DATA_SEM_JOURNAL); +#endif + ext4_debug("Journal inode found at %p: %lld bytes\n", journal_inode, journal_inode->i_size); return journal_inode; --=20 2.53.0 From nobody Sat Jun 13 03:30:33 2026 Received: from sender4-op-o15.zoho.com (sender4-op-o15.zoho.com [136.143.188.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B1232DC357; Mon, 11 May 2026 08:45:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.15 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489137; cv=pass; b=MRDu7GVL6gq8gvATz2Sr7xDPNX8mezeBPu6+Ku5GbPGg85vls4hC+udTZnFJgkgfoHqj6oCAx8N5fnXYpTlr97MPBrAAWUU/SbmlLJJ3m9oFaqIgv971/Oe4dJ+AztdwGrJOKnwKJZTJh48UUbQ+Fan5TLxdjFaxlYHjoo7+D9g= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489137; c=relaxed/simple; bh=z0aqBXBkvEdgPC6uKc9Oosh2TMBsiabUNLS8ytAtNVA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PX4drabo1FdjVkEvI4cQ+H3b4T6391di8AYyvnUYDzVmnsIMqrFQCqmRSYS6YGzZaOFrEOUU4mBcH9/6OauGwrnCOxIHbtidDZL54+oVLmh6QIw/GdCEO+rBrVqI9auhAe845iB0iCHMUloFyw1T+0XVCbXJdE2pVmgsQrpUa2A= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=SexwZsNp; arc=pass smtp.client-ip=136.143.188.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="SexwZsNp" ARC-Seal: i=1; a=rsa-sha256; t=1778489027; cv=none; d=zohomail.com; s=zohoarc; b=RfF22A+RC2TFmMkFbInjTS/V+5T4FwPbvqXua64JhXU8mnCPfdRb86v2sz+NHIWRHM5Y2ZWyKxer/s4hyFvJuJvHoYSJewHu9SOdGb3VYpWvCan0hPfr/Y2G2ljiVwdA97SvjhXQyRb0N7YwAOCAYa79uC17HkJPmZi5507iBrs= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1778489027; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; bh=pNL0p9xrlxkIMetCGECLWmYm+gVZqOGcNx9EPCOnSzc=; b=ZuVGLLV+5x8ExZdui2Sj629G7yDnDIwVcdlDIoa4uhO9N/6nzyxFQFVeSKWYwoxvXSRLaY2NGz0bFoQnJeGmYT/2wYUgJbk/S7Z+vM27mrqZn+o9lDWVl1vL00CcKrkNzhdfLzvXq3pk/GSGuTuofVA1wSppVkLVAXpSTqtucAk= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=linux.beauty; spf=pass smtp.mailfrom=me@linux.beauty; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1778489027; s=zmail; d=linux.beauty; i=me@linux.beauty; h=From:From:To:To:Cc:Cc:Subject:Subject:Date:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Transfer-Encoding:Message-Id:Reply-To; bh=pNL0p9xrlxkIMetCGECLWmYm+gVZqOGcNx9EPCOnSzc=; b=SexwZsNpbXOz1BJg4onEAtV0Bh5TLHn++3EjVeD4CqaRbLj5PBPpr1jiCqnSAqXP NH3hJVXOfBKVZUiXTRJKxqz8xUubSnDMGXzB6DTTKLvToh2SLGy8CnZlH3wX8JY2zWg 0YhE8US5bfm9e1MptThoqPSS0n/TFnjaX/b3gsQ0= Received: by mx.zohomail.com with SMTPS id 1778489024006938.2404796329722; Mon, 11 May 2026 01:43:44 -0700 (PDT) From: Li Chen To: Zhang Yi , Theodore Ts'o , Andreas Dilger , Baokun Li , Jan Kara , Ojaswin Mujoo , "Ritesh Harjani (IBM)" , Zhang Yi , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org Subject: [RFC v7 3/7] ext4: fast commit: avoid waiting for FC_COMMITTING Date: Mon, 11 May 2026 16:42:58 +0800 Message-ID: <20260511084304.1559557-4-me@linux.beauty> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260511084304.1559557-1-me@linux.beauty> References: <20260511084304.1559557-1-me@linux.beauty> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Content-Type: text/plain; charset="utf-8" ext4_fc_track_inode() can be called while holding i_data_sem (e.g. fallocate). Waiting for EXT4_STATE_FC_COMMITTING in that case risks an ABBA deadlock: i_data_sem -> wait(FC_COMMITTING) vs FC_COMMITTING -> wait(i_data_sem) in the commit task. Now that fast commit snapshots inode state at commit time, updates during log writing do not need to block. Drop the wait and lockdep assertion in ext4_fc_track_inode(), and make ext4_fc_del() wait for FC_COMMITTING so an inode cannot be removed while the commit thread is still using it. When an inode is modified during a fast commit, mark it with EXT4_STATE_FC_REQUEUE so cleanup keeps it queued for the next fast commit. This is needed because jbd2_fc_end_commit() invokes the cleanup callback with tid =3D=3D 0, so tid-based requeue logic would requeue every inode. Testing: tracepoint ext4:ext4_fc_commit_stop with two fsyncs in the same transaction. nblks is the number of journal blocks written for that fast commit. Before this change, the second fsync still wrote almost the same fast commit log (nblks 10->9), because tid =3D=3D 0 in jbd2_fc_end_commit() caused the tid-based requeue logic to keep all inodes queued. After this change, only inodes modified during the commit are requeued, and the second fsync wrote a nearly empty fast commit (nblks 10->1). Signed-off-by: Li Chen --- fs/ext4/ext4.h | 1 + fs/ext4/fast_commit.c | 111 ++++++++++++++++++++---------------------- 2 files changed, 53 insertions(+), 59 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 05c8f67625b4..2a706acdfaf8 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1991,6 +1991,7 @@ enum { EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */ EXT4_STATE_FC_FLUSHING_DATA, /* Fast commit flushing data */ EXT4_STATE_ORPHAN_FILE, /* Inode orphaned in orphan file */ + EXT4_STATE_FC_REQUEUE, /* Inode modified during fast commit */ }; =20 #define EXT4_INODE_BIT_FNS(name, field, offset) \ diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index cd4eac4e7dcb..273bf34031ae 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -62,9 +62,8 @@ * setting "EXT4_STATE_FC_COMMITTING" state, and snapshot the inode st= ate * needed for log writing. * [5] Unlock the journal by calling jbd2_journal_unlock_updates(). This a= llows - * starting of new handles. If new handles try to start an update on - * any of the inodes that are being committed, ext4_fc_track_inode() - * will block until those inodes have finished the fast commit. + * starting of new handles. Updates to inodes being fast committed are + * tracked for requeue rather than blocking. * [6] Commit all the directory entry updates in the fast commit space. * [7] Commit all the changed inodes in the fast commit space. * [8] Write tail tag (this tag ensures the atomicity, please read the fol= lowing @@ -218,6 +217,7 @@ void ext4_fc_init_inode(struct inode *inode) =20 ext4_fc_reset_inode(inode); ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING); + ext4_clear_inode_state(inode, EXT4_STATE_FC_REQUEUE); INIT_LIST_HEAD(&ei->i_fc_list); INIT_LIST_HEAD(&ei->i_fc_dilist); ei->i_fc_snap =3D NULL; @@ -257,22 +257,30 @@ void ext4_fc_del(struct inode *inode) } =20 /* - * Since ext4_fc_del is called from ext4_evict_inode while having a - * handle open, there is no need for us to wait here even if a fast - * commit is going on. That is because, if this inode is being - * committed, ext4_mark_inode_dirty would have waited for inode commit - * operation to finish before we come here. So, by the time we come - * here, inode's EXT4_STATE_FC_COMMITTING would have been cleared. So, - * we shouldn't see EXT4_STATE_FC_COMMITTING to be set on this inode - * here. - * - * We may come here without any handles open in the "no_delete" case of - * ext4_evict_inode as well. However, if that happens, we first mark the - * file system as fast commit ineligible anyway. So, even in that case, - * it is okay to remove the inode from the fc list. + * Wait for ongoing fast commit to finish. We cannot remove the inode + * from fast commit lists while it is being committed. */ - WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING) - && !ext4_test_mount_flag(inode->i_sb, EXT4_MF_FC_INELIGIBLE)); + while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) { +#if (BITS_PER_LONG < 64) + DEFINE_WAIT_BIT(wait, &ei->i_state_flags, + EXT4_STATE_FC_COMMITTING); + wq =3D bit_waitqueue(&ei->i_state_flags, + EXT4_STATE_FC_COMMITTING); +#else + DEFINE_WAIT_BIT(wait, &ei->i_flags, + EXT4_STATE_FC_COMMITTING); + wq =3D bit_waitqueue(&ei->i_flags, + EXT4_STATE_FC_COMMITTING); +#endif + prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); + if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) { + ext4_fc_unlock(inode->i_sb, alloc_ctx); + schedule(); + alloc_ctx =3D ext4_fc_lock(inode->i_sb); + } + finish_wait(wq, &wait.wq_entry); + } + while (ext4_test_inode_state(inode, EXT4_STATE_FC_FLUSHING_DATA)) { #if (BITS_PER_LONG < 64) DEFINE_WAIT_BIT(wait, &ei->i_state_flags, @@ -293,19 +301,22 @@ void ext4_fc_del(struct inode *inode) } finish_wait(wq, &wait.wq_entry); } + ext4_fc_free_inode_snap(inode); list_del_init(&ei->i_fc_list); =20 /* - * Since this inode is getting removed, let's also remove all FC - * dentry create references, since it is not needed to log it anyways. + * Since this inode is getting removed, let's also remove all FC dentry + * create references, since it is not needed to log it anyways. */ if (list_empty(&ei->i_fc_dilist)) { ext4_fc_unlock(inode->i_sb, alloc_ctx); return; } =20 - fc_dentry =3D list_first_entry(&ei->i_fc_dilist, struct ext4_fc_dentry_up= date, fcd_dilist); + fc_dentry =3D list_first_entry(&ei->i_fc_dilist, + struct ext4_fc_dentry_update, + fcd_dilist); WARN_ON(fc_dentry->fcd_op !=3D EXT4_FC_TAG_CREAT); list_del_init(&fc_dentry->fcd_list); list_del_init(&fc_dentry->fcd_dilist); @@ -377,6 +388,8 @@ static int ext4_fc_track_template( =20 tid =3D handle->h_transaction->t_tid; spin_lock(&ei->i_fc_lock); + if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) + ext4_set_inode_state(inode, EXT4_STATE_FC_REQUEUE); if (tid =3D=3D ei->i_sync_tid) { update =3D true; } else { @@ -547,8 +560,6 @@ static int __track_inode(handle_t *handle, struct inode= *inode, void *arg, =20 void ext4_fc_track_inode(handle_t *handle, struct inode *inode) { - struct ext4_inode_info *ei =3D EXT4_I(inode); - wait_queue_head_t *wq; int ret; =20 if (S_ISDIR(inode->i_mode)) @@ -564,29 +575,11 @@ void ext4_fc_track_inode(handle_t *handle, struct ino= de *inode) return; =20 /* - * If we come here, we may sleep while waiting for the inode to - * commit. We shouldn't be holding i_data_sem when we go to sleep since - * the commit path needs to grab the lock while committing the inode. + * Fast commit snapshots inode state at commit time, so there's no need + * to wait for EXT4_STATE_FC_COMMITTING here. If the inode is already + * on the commit queue, ext4_fc_cleanup() will requeue it for the new + * transaction once the current commit finishes. */ - lockdep_assert_not_held(&ei->i_data_sem); - - while (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) { -#if (BITS_PER_LONG < 64) - DEFINE_WAIT_BIT(wait, &ei->i_state_flags, - EXT4_STATE_FC_COMMITTING); - wq =3D bit_waitqueue(&ei->i_state_flags, - EXT4_STATE_FC_COMMITTING); -#else - DEFINE_WAIT_BIT(wait, &ei->i_flags, - EXT4_STATE_FC_COMMITTING); - wq =3D bit_waitqueue(&ei->i_flags, - EXT4_STATE_FC_COMMITTING); -#endif - prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE); - if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) - schedule(); - finish_wait(wq, &wait.wq_entry); - } =20 /* * From this point on, this inode will not be committed either @@ -1510,32 +1503,32 @@ static void ext4_fc_cleanup(journal_t *journal, int= full, tid_t tid) =20 alloc_ctx =3D ext4_fc_lock(sb); while (!list_empty(&sbi->s_fc_q[FC_Q_MAIN])) { + bool requeue; + ei =3D list_first_entry(&sbi->s_fc_q[FC_Q_MAIN], struct ext4_inode_info, i_fc_list); list_del_init(&ei->i_fc_list); ext4_fc_free_inode_snap(&ei->vfs_inode); + spin_lock(&ei->i_fc_lock); + if (full) + requeue =3D !tid_geq(tid, ei->i_sync_tid); + else + requeue =3D ext4_test_inode_state(&ei->vfs_inode, + EXT4_STATE_FC_REQUEUE); + if (!requeue) + ext4_fc_reset_inode(&ei->vfs_inode); + ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_REQUEUE); ext4_clear_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING); - if (tid_geq(tid, ei->i_sync_tid)) { - ext4_fc_reset_inode(&ei->vfs_inode); - } else if (full) { - /* - * We are called after a full commit, inode has been - * modified while the commit was running. Re-enqueue - * the inode into STAGING, which will then be splice - * back into MAIN. This cannot happen during - * fastcommit because the journal is locked all the - * time in that case (and tid doesn't increase so - * tid check above isn't reliable). - */ + spin_unlock(&ei->i_fc_lock); + if (requeue) list_add_tail(&ei->i_fc_list, &sbi->s_fc_q[FC_Q_STAGING]); - } /* * Make sure clearing of EXT4_STATE_FC_COMMITTING is * visible before we send the wakeup. Pairs with implicit - * barrier in prepare_to_wait() in ext4_fc_track_inode(). + * barrier in prepare_to_wait() in ext4_fc_del(). */ smp_mb(); #if (BITS_PER_LONG < 64) --=20 2.53.0 From nobody Sat Jun 13 03:30:33 2026 Received: from sender4-op-o15.zoho.com (sender4-op-o15.zoho.com [136.143.188.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 563BE2DC357; Mon, 11 May 2026 08:46:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.15 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489165; cv=pass; b=BQI+LpGZqApvQD/YCpGQmAfnj6UpwXF5qyQFgqPdszE6E5Eq04C3EcM0MNuY5XVnOKwWCNhdlrBk0THfyLLUFw+rBdfzlP38aOGkihUgxpgDCWjVuYFCwNadReiWvxLOL/T0samhjB9evOeAlVxQXWfnWM0aG20tli8IxUFtWV0= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489165; c=relaxed/simple; bh=3t4RVB/WVQnMJ9CP14QFMTycPdqOlWMmWgsBNjyLuDY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=XC1pHGGEqdEZsW/ta2LjIPgVkswkDKrYSwRLAbhOpAsPHnAV3oq8hj2hVNjI6VqCwv6UX4vbmz6LKPkM3XAvNzkAQY4wVE6FCtQOz9vX01E9ePv77x8BQb8NaxuYkrfVVWPCSWHSwOdKd/GW6ELG364rNXsL59dNbSTFM4ZZihM= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=oxJ0H9fg; arc=pass smtp.client-ip=136.143.188.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="oxJ0H9fg" ARC-Seal: i=1; a=rsa-sha256; t=1778489033; cv=none; d=zohomail.com; s=zohoarc; b=Gx7lXXPBO2hCZmi3H31tM4HZetzM9AeCT2cfeO6pGIfh2dxcAJE8VqpERWKjZh4ulU13Ha4PLzN0h59cnv4k430sS2mcaLiGSwFud9fiiMS5/nYn+8rKD9zrXcrd1+PuCY53q6wjd6RY6h0NngzfoxrB+zhZXPzfjTD8Hm+LxVQ= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1778489033; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; bh=cB8mv5pgJ8VYeNK+vanExt3/tJtti6pjz35jqbs2VNY=; b=YHA1iGK27ctICWpw6JIQIlftfhCCC6PO0Yx9NVSJBIXnWJOmO3p6pqudwAhs0OE1YE3F5o/xPwOOwS5L7U5nk4LiLCjkDt0FRWasvG6Cu34oXcRmAIZ5hVnet5m/Kz/25M+oBQmPfkZhIkWAHhyNSx2FrJbTDE9A4D3Qxjf/4vM= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=linux.beauty; spf=pass smtp.mailfrom=me@linux.beauty; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1778489033; s=zmail; d=linux.beauty; i=me@linux.beauty; h=From:From:To:To:Cc:Cc:Subject:Subject:Date:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Transfer-Encoding:Message-Id:Reply-To; bh=cB8mv5pgJ8VYeNK+vanExt3/tJtti6pjz35jqbs2VNY=; b=oxJ0H9fgGDTpBKdGB1Nnep8mEOAr8Kor+EKtm9nlEyxZuSqeUlp76QbKmW2hlLf5 n0/VBb03A1tMP0JM93BQhQ2BDq4Ss5YuhWO2jHHCv3rJjYApuiI1V3oPqwA6ChDuOMO TaVpVhtuW1g2W/Y+43QIcXpR9l2MO26SSV882xDg= Received: by mx.zohomail.com with SMTPS id 177848902915965.24901118520575; Mon, 11 May 2026 01:43:49 -0700 (PDT) From: Li Chen To: Zhang Yi , Theodore Ts'o , Andreas Dilger , Baokun Li , Jan Kara , Ojaswin Mujoo , "Ritesh Harjani (IBM)" , Zhang Yi , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org Subject: [RFC v7 4/7] ext4: fast commit: avoid self-deadlock in inode snapshotting Date: Mon, 11 May 2026 16:42:59 +0800 Message-ID: <20260511084304.1559557-5-me@linux.beauty> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260511084304.1559557-1-me@linux.beauty> References: <20260511084304.1559557-1-me@linux.beauty> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Content-Type: text/plain; charset="utf-8" ext4_fc_snapshot_inodes() used igrab()/iput() to pin inodes while building commit-time snapshots. With ext4_fc_del() waiting for EXT4_STATE_FC_COMMITTING, iput() can trigger ext4_clear_inode()->ext4_fc_del() in the commit thread and deadlock waiting for the fast commit to finish. Avoid taking extra references. Collect inode pointers under s_fc_lock and rely on EXT4_STATE_FC_COMMITTING to pin inodes until ext4_fc_cleanup() clears the bit. Also set EXT4_STATE_FC_COMMITTING for create-only inodes referenced from the dentry update queue, and wake up waiters when ext4_fc_cleanup() clears the bit. Signed-off-by: Li Chen --- fs/ext4/fast_commit.c | 47 ++++++++++++++++++++++++++++++++----------- 1 file changed, 35 insertions(+), 12 deletions(-) diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index 273bf34031ae..f9bb18c0b549 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -1195,13 +1195,12 @@ static int ext4_fc_snapshot_inodes(journal_t *journ= al) =20 alloc_ctx =3D ext4_fc_lock(sb); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { - inodes[i] =3D igrab(&iter->vfs_inode); - if (inodes[i]) - i++; + inodes[i++] =3D &iter->vfs_inode; } =20 list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) { struct ext4_inode_info *ei; + struct inode *inode; =20 if (fc_dentry->fcd_op !=3D EXT4_FC_TAG_CREAT) continue; @@ -1211,12 +1210,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journ= al) /* See the comment in ext4_fc_commit_dentry_updates(). */ ei =3D list_first_entry(&fc_dentry->fcd_dilist, struct ext4_inode_info, i_fc_dilist); + inode =3D &ei->vfs_inode; if (!list_empty(&ei->i_fc_list)) continue; =20 - inodes[i] =3D igrab(&ei->vfs_inode); - if (inodes[i]) - i++; + /* + * Create-only inodes may only be referenced via fcd_dilist and + * not appear on s_fc_q[MAIN]. They may hit the last iput while + * we are snapshotting, but inode eviction calls ext4_fc_del(), + * which waits for FC_COMMITTING to clear. Mark them FC_COMMITTING + * so the inode stays pinned and the snapshot stays valid until + * ext4_fc_cleanup(). + */ + ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING); + inodes[i++] =3D inode; } ext4_fc_unlock(sb, alloc_ctx); =20 @@ -1226,10 +1233,6 @@ static int ext4_fc_snapshot_inodes(journal_t *journa= l) break; } =20 - for (nr_inodes =3D 0; nr_inodes < i; nr_inodes++) { - if (inodes[nr_inodes]) - iput(inodes[nr_inodes]); - } kvfree(inodes); return ret; } @@ -1297,8 +1300,9 @@ static int ext4_fc_perform_commit(journal_t *journal) jbd2_journal_lock_updates(journal); /* * The journal is now locked. No more handles can start and all the - * previous handles are now drained. We now mark the inodes on the - * commit queue as being committed. + * previous handles are now drained. Snapshotting happens in this + * window so log writing can consume only stable snapshots without + * doing logical-to-physical mapping. */ alloc_ctx =3D ext4_fc_lock(sb); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { @@ -1550,6 +1554,25 @@ static void ext4_fc_cleanup(journal_t *journal, int = full, tid_t tid) struct ext4_inode_info, i_fc_dilist); ext4_fc_free_inode_snap(&ei->vfs_inode); + spin_lock(&ei->i_fc_lock); + ext4_clear_inode_state(&ei->vfs_inode, + EXT4_STATE_FC_REQUEUE); + ext4_clear_inode_state(&ei->vfs_inode, + EXT4_STATE_FC_COMMITTING); + spin_unlock(&ei->i_fc_lock); + /* + * Make sure clearing of EXT4_STATE_FC_COMMITTING is + * visible before we send the wakeup. Pairs with implicit + * barrier in prepare_to_wait() in ext4_fc_del(). + */ + smp_mb(); +#if (BITS_PER_LONG < 64) + wake_up_bit(&ei->i_state_flags, + EXT4_STATE_FC_COMMITTING); +#else + wake_up_bit(&ei->i_flags, + EXT4_STATE_FC_COMMITTING); +#endif } list_del_init(&fc_dentry->fcd_dilist); =20 --=20 2.53.0 From nobody Sat Jun 13 03:30:33 2026 Received: from sender4-op-o15.zoho.com (sender4-op-o15.zoho.com [136.143.188.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 39B242DC357; Mon, 11 May 2026 08:46:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.15 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489193; cv=pass; b=KhZ3QsbyR0yjMj95YTaw34M1MCAonUlctO88hkTkNbUF6qBrtICoGyFVlXIYdU2sc7hktgEHr6lydQ87PW1VO11huPq4vK2MvgpGrlqRlvdW6jVoq/N3jG9UiwLXggoULi9lHQVuDwFpxW6BWNe7PX0jOoJ6r1tIxsSqIbqJttM= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489193; c=relaxed/simple; bh=MrrEwY7AIm58bXp/ZX+9E9gYgtVD8lGh0GVDDZrj1mA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oTOwanMIlHkEGbntAZjeIRl3/w9msTafS53sbMT1qFmQiwlFI3lvnPyKgyh56CmBDwJ76DmAD9IfZr18qvpYTcp4cT9EGa7DJBEXcrELJujqH4PT84/M07we4TjxijQyz0VOYA6g1VyDJvYTOIOSiiJWCcqguQ/aweuELYq3QOM= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=Ac/rHAMa; arc=pass smtp.client-ip=136.143.188.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="Ac/rHAMa" ARC-Seal: i=1; a=rsa-sha256; t=1778489037; cv=none; d=zohomail.com; s=zohoarc; b=j6mSpy4nPMAJNbYHdSzGhB/GlrqM4ZGoQSXvQeEpf3xRvldUHu5kHEzidwDwrm0+i8zcqVoFQXunI5Gc4e3EqoPUULH82rTtVlbuHNQOdTsWCH9vDSi4etPDPK9hgLShQfkLh+OmuXIY3g0gOiHP+42n5F+Q5z1Z4CzC7GNo9UA= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1778489037; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; bh=Y8Xxt3LfBC8k6KeKouRjskSmFCKYqYaicKxDFCHJmTM=; b=Q3sFVmA0427YKgNMDmLof3HsNGMRtMW0kntYhyTww7TX48ZOEuUXrjJMVBpeGZTTXdwpzVlFD7/NyLkUUlPPrFiwDtds2a/d8k4Td+nzlif7XEh4KCUxBwchiHVPsUNc/AKxyn5ca+zVr07PWeN7YWMhWZxEXEeydYGROtbu2NM= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=linux.beauty; spf=pass smtp.mailfrom=me@linux.beauty; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1778489037; s=zmail; d=linux.beauty; i=me@linux.beauty; h=From:From:To:To:Cc:Cc:Subject:Subject:Date:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Transfer-Encoding:Message-Id:Reply-To; bh=Y8Xxt3LfBC8k6KeKouRjskSmFCKYqYaicKxDFCHJmTM=; b=Ac/rHAMaarFU3v7M5T0JcaLQXFIZL/7iHV8Fuy12TsGdR1glNyLirMCuyVxvPVmn 34jdHjPdyhyP7lmzjt8NS5hkBpSoKZhskKcaQqUxq5elQy3bQa5HDVqGDQp1eAdqqXk WUpMJf1/wtaOKygDlA/0Po5JC6uqMammho+xLmis= Received: by mx.zohomail.com with SMTPS id 1778489034861779.1047126173121; Mon, 11 May 2026 01:43:54 -0700 (PDT) From: Li Chen To: Zhang Yi , Theodore Ts'o , Andreas Dilger , Baokun Li , Jan Kara , Ojaswin Mujoo , "Ritesh Harjani (IBM)" , Zhang Yi , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org Subject: [RFC v7 5/7] ext4: fast commit: avoid i_data_sem by dropping ext4_map_blocks() in snapshots Date: Mon, 11 May 2026 16:43:00 +0800 Message-ID: <20260511084304.1559557-6-me@linux.beauty> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260511084304.1559557-1-me@linux.beauty> References: <20260511084304.1559557-1-me@linux.beauty> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Content-Type: text/plain; charset="utf-8" Commit-time snapshots run under jbd2_journal_lock_updates(), so the work done there must stay bounded. The snapshot path still used ext4_map_blocks() to build data ranges. This can take i_data_sem and pulls the mapping code into the snapshot logic. Build inode data range snapshots from the extent status tree instead. The extent status tree is a cache, not an authoritative source. If the needed information is missing or unstable (e.g. delayed allocation), treat the transaction as fast commit ineligible and fall back to full commit. Also cap the number of inodes and ranges snapshotted per fast commit and allocate range records from a dedicated slab cache. The inode pointer array is allocated outside the updates-locked window. Testing: QEMU/KVM guest, virtio-pmem + dax, ext4 -O fast_commit, mounted dax,noatime. Ran python3 500x {4K write + fsync}, fallocate 256M, and python3 500x {creat + fsync(dir)} without lockdep splats or errors. Signed-off-by: Li Chen --- Changes in v7: - Address Sashiko review by guarding snapshot range arithmetic near EXT_MAX_BLOCKS to avoid cur_lblk / remaining-range wraparound in the snapshot walk. fs/ext4/fast_commit.c | 257 +++++++++++++++++++++++++++++------------- 1 file changed, 181 insertions(+), 76 deletions(-) diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index f9bb18c0b549..9fc17c1fa7af 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -184,6 +184,15 @@ =20 #include static struct kmem_cache *ext4_fc_dentry_cachep; +static struct kmem_cache *ext4_fc_range_cachep; + +/* + * Avoid spending unbounded time/memory snapshotting highly fragmented fil= es + * under jbd2_journal_lock_updates(). If we exceed this limit, fall back to + * full commit. + */ +#define EXT4_FC_SNAPSHOT_MAX_INODES 1024 +#define EXT4_FC_SNAPSHOT_MAX_RANGES 2048 =20 static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate) { @@ -938,7 +947,7 @@ static void ext4_fc_free_ranges(struct list_head *head) =20 list_for_each_entry_safe(range, range_n, head, list) { list_del(&range->list); - kfree(range); + kmem_cache_free(ext4_fc_range_cachep, range); } } =20 @@ -956,16 +965,19 @@ static void ext4_fc_free_inode_snap(struct inode *ino= de) } =20 static int ext4_fc_snapshot_inode_data(struct inode *inode, - struct list_head *ranges) + struct list_head *ranges, + unsigned int nr_ranges_total, + unsigned int *nr_rangesp) { struct ext4_inode_info *ei =3D EXT4_I(inode); + unsigned int nr_ranges =3D 0; ext4_lblk_t start_lblk, end_lblk, cur_lblk; - struct ext4_map_blocks map; - int ret; =20 spin_lock(&ei->i_fc_lock); if (ei->i_fc_lblk_len =3D=3D 0) { spin_unlock(&ei->i_fc_lock); + if (nr_rangesp) + *nr_rangesp =3D 0; return 0; } start_lblk =3D ei->i_fc_lblk_start; @@ -979,61 +991,82 @@ static int ext4_fc_snapshot_inode_data(struct inode *= inode, (unsigned long long)inode->i_ino); =20 while (cur_lblk <=3D end_lblk) { + struct extent_status es; struct ext4_fc_range *range; + ext4_lblk_t len; + u64 remaining =3D (u64)end_lblk - cur_lblk + 1; =20 - map.m_lblk =3D cur_lblk; - map.m_len =3D end_lblk - cur_lblk + 1; - ret =3D ext4_map_blocks(NULL, inode, &map, - EXT4_GET_BLOCKS_IO_SUBMIT | - EXT4_EX_NOCACHE); - if (ret < 0) - return -ECANCELED; + if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) + return -EAGAIN; + + if (ext4_es_is_delayed(&es)) + return -EAGAIN; =20 - if (map.m_len =3D=3D 0) { + len =3D es.es_len - (cur_lblk - es.es_lblk); + if (len > remaining) + len =3D remaining; + if (len =3D=3D 0) { cur_lblk++; continue; } =20 - range =3D kmalloc(sizeof(*range), GFP_NOFS); + if (nr_ranges_total + nr_ranges >=3D EXT4_FC_SNAPSHOT_MAX_RANGES) + return -E2BIG; + + range =3D kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS); if (!range) return -ENOMEM; + nr_ranges++; =20 - range->lblk =3D map.m_lblk; - range->len =3D map.m_len; + range->lblk =3D cur_lblk; + range->len =3D len; range->pblk =3D 0; range->unwritten =3D false; =20 - if (ret =3D=3D 0) { + if (ext4_es_is_hole(&es)) { range->tag =3D EXT4_FC_TAG_DEL_RANGE; - } else { - unsigned int max =3D (map.m_flags & EXT4_MAP_UNWRITTEN) ? - EXT_UNWRITTEN_MAX_LEN : EXT_INIT_MAX_LEN; - - /* Limit the number of blocks in one extent */ - map.m_len =3D min(max, map.m_len); + } else if (ext4_es_is_written(&es) || + ext4_es_is_unwritten(&es)) { + unsigned int max; =20 range->tag =3D EXT4_FC_TAG_ADD_RANGE; - range->len =3D map.m_len; - range->pblk =3D map.m_pblk; - range->unwritten =3D !!(map.m_flags & EXT4_MAP_UNWRITTEN); + range->pblk =3D ext4_es_pblock(&es) + + (cur_lblk - es.es_lblk); + range->unwritten =3D ext4_es_is_unwritten(&es); + + max =3D range->unwritten ? EXT_UNWRITTEN_MAX_LEN : + EXT_INIT_MAX_LEN; + if (range->len > max) + range->len =3D max; + } else { + kmem_cache_free(ext4_fc_range_cachep, range); + return -EAGAIN; } =20 INIT_LIST_HEAD(&range->list); list_add_tail(&range->list, ranges); =20 - cur_lblk +=3D map.m_len; + if ((u64)range->len > (u64)end_lblk - cur_lblk) + break; + + cur_lblk +=3D range->len; } =20 + if (nr_rangesp) + *nr_rangesp =3D nr_ranges; return 0; } =20 -static int ext4_fc_snapshot_inode(struct inode *inode) +static int ext4_fc_snapshot_inode(struct inode *inode, + unsigned int nr_ranges_total, + unsigned int *nr_rangesp) { struct ext4_inode_info *ei =3D EXT4_I(inode); struct ext4_fc_inode_snap *snap; int inode_len =3D EXT4_GOOD_OLD_INODE_SIZE; struct ext4_iloc iloc; LIST_HEAD(ranges); + unsigned int nr_ranges =3D 0; int ret; int alloc_ctx; =20 @@ -1057,7 +1090,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode) memcpy(snap->inode_buf, (u8 *)ext4_raw_inode(&iloc), inode_len); brelse(iloc.bh); =20 - ret =3D ext4_fc_snapshot_inode_data(inode, &ranges); + ret =3D ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total, + &nr_ranges); if (ret) { kfree(snap); ext4_fc_free_ranges(&ranges); @@ -1070,10 +1104,11 @@ static int ext4_fc_snapshot_inode(struct inode *ino= de) list_splice_tail_init(&ranges, &snap->data_list); ext4_fc_unlock(inode->i_sb, alloc_ctx); =20 + if (nr_rangesp) + *nr_rangesp =3D nr_ranges; return 0; } =20 - /* Flushes data of all the inodes in the commit queue. */ static int ext4_fc_flush_data(journal_t *journal) { @@ -1152,49 +1187,32 @@ static int ext4_fc_commit_dentry_updates(journal_t = *journal, u32 *crc) return 0; } =20 -static int ext4_fc_snapshot_inodes(journal_t *journal) +static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb, + struct inode ***inodesp, + unsigned int *nr_inodesp); + +static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inod= es, + unsigned int inodes_size) { struct super_block *sb =3D journal->j_private; struct ext4_sb_info *sbi =3D EXT4_SB(sb); struct ext4_inode_info *iter; struct ext4_fc_dentry_update *fc_dentry; - struct inode **inodes; - unsigned int nr_inodes =3D 0; unsigned int i =3D 0; + unsigned int idx; + unsigned int nr_ranges =3D 0; int ret =3D 0; int alloc_ctx; =20 - alloc_ctx =3D ext4_fc_lock(sb); - list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) - nr_inodes++; - - list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) { - struct ext4_inode_info *ei; - - if (fc_dentry->fcd_op !=3D EXT4_FC_TAG_CREAT) - continue; - if (list_empty(&fc_dentry->fcd_dilist)) - continue; - - /* See the comment in ext4_fc_commit_dentry_updates(). */ - ei =3D list_first_entry(&fc_dentry->fcd_dilist, - struct ext4_inode_info, i_fc_dilist); - if (!list_empty(&ei->i_fc_list)) - continue; - - nr_inodes++; - } - ext4_fc_unlock(sb, alloc_ctx); - - if (!nr_inodes) + if (!inodes_size) return 0; =20 - inodes =3D kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS); - if (!inodes) - return -ENOMEM; - alloc_ctx =3D ext4_fc_lock(sb); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { + if (i >=3D inodes_size) { + ret =3D -E2BIG; + goto unlock; + } inodes[i++] =3D &iter->vfs_inode; } =20 @@ -1214,6 +1232,10 @@ static int ext4_fc_snapshot_inodes(journal_t *journa= l) if (!list_empty(&ei->i_fc_list)) continue; =20 + if (i >=3D inodes_size) { + ret =3D -E2BIG; + goto unlock; + } /* * Create-only inodes may only be referenced via fcd_dilist and * not appear on s_fc_q[MAIN]. They may hit the last iput while @@ -1225,15 +1247,22 @@ static int ext4_fc_snapshot_inodes(journal_t *journ= al) ext4_set_inode_state(inode, EXT4_STATE_FC_COMMITTING); inodes[i++] =3D inode; } +unlock: ext4_fc_unlock(sb, alloc_ctx); =20 - for (nr_inodes =3D 0; nr_inodes < i; nr_inodes++) { - ret =3D ext4_fc_snapshot_inode(inodes[nr_inodes]); + if (ret) + return ret; + + for (idx =3D 0; idx < i; idx++) { + unsigned int inode_ranges =3D 0; + + ret =3D ext4_fc_snapshot_inode(inodes[idx], nr_ranges, + &inode_ranges); if (ret) break; + nr_ranges +=3D inode_ranges; } =20 - kvfree(inodes); return ret; } =20 @@ -1244,6 +1273,8 @@ static int ext4_fc_perform_commit(journal_t *journal) struct ext4_inode_info *iter; struct ext4_fc_head head; struct inode *inode; + struct inode **inodes; + unsigned int inodes_size; struct blk_plug plug; int ret =3D 0; u32 crc =3D 0; @@ -1296,6 +1327,10 @@ static int ext4_fc_perform_commit(journal_t *journal) return ret; =20 =20 + ret =3D ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size); + if (ret) + return ret; + /* Step 4: Mark all inodes as being committed. */ jbd2_journal_lock_updates(journal); /* @@ -1311,8 +1346,9 @@ static int ext4_fc_perform_commit(journal_t *journal) } ext4_fc_unlock(sb, alloc_ctx); =20 - ret =3D ext4_fc_snapshot_inodes(journal); + ret =3D ext4_fc_snapshot_inodes(journal, inodes, inodes_size); jbd2_journal_unlock_updates(journal); + kvfree(inodes); if (ret) return ret; =20 @@ -1368,6 +1404,64 @@ static int ext4_fc_perform_commit(journal_t *journal) return ret; } =20 +static unsigned int ext4_fc_count_snapshot_inodes(struct super_block *sb) +{ + struct ext4_sb_info *sbi =3D EXT4_SB(sb); + struct ext4_inode_info *iter; + struct ext4_fc_dentry_update *fc_dentry; + unsigned int nr_inodes =3D 0; + int alloc_ctx; + + alloc_ctx =3D ext4_fc_lock(sb); + list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) + nr_inodes++; + + list_for_each_entry(fc_dentry, &sbi->s_fc_dentry_q[FC_Q_MAIN], fcd_list) { + struct ext4_inode_info *ei; + + if (fc_dentry->fcd_op !=3D EXT4_FC_TAG_CREAT) + continue; + if (list_empty(&fc_dentry->fcd_dilist)) + continue; + + /* See the comment in ext4_fc_commit_dentry_updates(). */ + ei =3D list_first_entry(&fc_dentry->fcd_dilist, + struct ext4_inode_info, i_fc_dilist); + if (!list_empty(&ei->i_fc_list)) + continue; + + nr_inodes++; + } + ext4_fc_unlock(sb, alloc_ctx); + + return nr_inodes; +} + +static int ext4_fc_alloc_snapshot_inodes(struct super_block *sb, + struct inode ***inodesp, + unsigned int *nr_inodesp) +{ + unsigned int nr_inodes =3D ext4_fc_count_snapshot_inodes(sb); + struct inode **inodes; + + *inodesp =3D NULL; + *nr_inodesp =3D 0; + + if (!nr_inodes) + return 0; + + if (nr_inodes > EXT4_FC_SNAPSHOT_MAX_INODES) + return -E2BIG; + + inodes =3D kvcalloc(nr_inodes, sizeof(*inodes), GFP_NOFS); + if (!inodes) + return -ENOMEM; + + *inodesp =3D inodes; + *nr_inodesp =3D nr_inodes; + return 0; +} + static void ext4_fc_update_stats(struct super_block *sb, int status, u64 commit_time, int nblks, tid_t commit_tid) { @@ -1460,7 +1554,10 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_= tid) fc_bufs_before =3D (sbi->s_fc_bytes + bsize - 1) / bsize; ret =3D ext4_fc_perform_commit(journal); if (ret < 0) { - status =3D EXT4_FC_STATUS_FAILED; + if (ret =3D=3D -EAGAIN || ret =3D=3D -E2BIG || ret =3D=3D -ECANCELED) + status =3D EXT4_FC_STATUS_INELIGIBLE; + else + status =3D EXT4_FC_STATUS_FAILED; goto fallback; } nblks =3D (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before; @@ -1544,34 +1641,35 @@ static void ext4_fc_cleanup(journal_t *journal, int= full, tid_t tid) =20 while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) { fc_dentry =3D list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN], - struct ext4_fc_dentry_update, - fcd_list); + struct ext4_fc_dentry_update, + fcd_list); list_del_init(&fc_dentry->fcd_list); if (fc_dentry->fcd_op =3D=3D EXT4_FC_TAG_CREAT && - !list_empty(&fc_dentry->fcd_dilist)) { + !list_empty(&fc_dentry->fcd_dilist)) { /* See the comment in ext4_fc_commit_dentry_updates(). */ ei =3D list_first_entry(&fc_dentry->fcd_dilist, - struct ext4_inode_info, - i_fc_dilist); + struct ext4_inode_info, + i_fc_dilist); ext4_fc_free_inode_snap(&ei->vfs_inode); spin_lock(&ei->i_fc_lock); ext4_clear_inode_state(&ei->vfs_inode, - EXT4_STATE_FC_REQUEUE); + EXT4_STATE_FC_REQUEUE); ext4_clear_inode_state(&ei->vfs_inode, - EXT4_STATE_FC_COMMITTING); + EXT4_STATE_FC_COMMITTING); spin_unlock(&ei->i_fc_lock); /* * Make sure clearing of EXT4_STATE_FC_COMMITTING is - * visible before we send the wakeup. Pairs with implicit - * barrier in prepare_to_wait() in ext4_fc_del(). + * visible before we send the wakeup. Pairs with + * implicit barrier in prepare_to_wait() in + * ext4_fc_del(). */ smp_mb(); #if (BITS_PER_LONG < 64) wake_up_bit(&ei->i_state_flags, - EXT4_STATE_FC_COMMITTING); + EXT4_STATE_FC_COMMITTING); #else wake_up_bit(&ei->i_flags, - EXT4_STATE_FC_COMMITTING); + EXT4_STATE_FC_COMMITTING); #endif } list_del_init(&fc_dentry->fcd_dilist); @@ -2548,13 +2646,20 @@ int __init ext4_fc_init_dentry_cache(void) ext4_fc_dentry_cachep =3D KMEM_CACHE(ext4_fc_dentry_update, SLAB_RECLAIM_ACCOUNT); =20 - if (ext4_fc_dentry_cachep =3D=3D NULL) + if (!ext4_fc_dentry_cachep) return -ENOMEM; =20 + ext4_fc_range_cachep =3D KMEM_CACHE(ext4_fc_range, SLAB_RECLAIM_ACCOUNT); + if (!ext4_fc_range_cachep) { + kmem_cache_destroy(ext4_fc_dentry_cachep); + return -ENOMEM; + } + return 0; } =20 void ext4_fc_destroy_dentry_cache(void) { + kmem_cache_destroy(ext4_fc_range_cachep); kmem_cache_destroy(ext4_fc_dentry_cachep); } --=20 2.53.0 From nobody Sat Jun 13 03:30:33 2026 Received: from sender4-op-o15.zoho.com (sender4-op-o15.zoho.com [136.143.188.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DE2E91E49F; Mon, 11 May 2026 08:46:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.15 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489221; cv=pass; b=UwTxfdrG9Mqn/t7QUGp362mL3dzHkeYQ+8E6DgW4quI05kcbouGmaCXgxDASkrBLkVOZ3kSWrxHHTMNSVbqv/vjdqRNR3M9/l3n72N6GfX5iWCQbPAwek+aYpjbMv/1jxhgRkTr/TulKnwpAXY68F7JLnFOpugBBBY2iYRzYZHo= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489221; c=relaxed/simple; bh=k3Hv1eaJp9IQxvVqMzfcNFjGvtv9bomawkaK6bthd+g=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jjMgEj4E1roRAPFdufvvi0Jtu3F/j0JDzHJJ5QKIHM8vuIPjCtpRJwWIgR4c2IdHY5m1hHWsJxskKwK+aGvXDykvXxSQ62eaoc6brI4iERIgsBE9xjtMK4RLPfbD23n8CN39lK2X81ehRUmXKuwF1NEGs+PdkluaanZWVj+Vi1Q= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=oMkfsLw4; arc=pass smtp.client-ip=136.143.188.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="oMkfsLw4" ARC-Seal: i=1; a=rsa-sha256; t=1778489043; cv=none; d=zohomail.com; s=zohoarc; b=FSuGgwT1tAdcCWw5hOc8FNLSJrgdkYRaym9vO2XOUqp4279qcJQcBCw/gZRJrCdY2i5NwJB2y5ATE+9lUebRsvME4iyyVdscU5n7f3LKd6+9uMdkrbNq377RLtT88+9Xlk6mWpXoOxtRaGID5sx+ifOqGUYcWb28NU2qtscIAFA= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1778489043; h=Content-Transfer-Encoding:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To:Cc; bh=jTrtfIQ31WOPCDzA9NvXdREsibDv3/UeQsSkA9Q0IOk=; b=ltNaBZoIZwllsKLMew3njx2VHfzbk1B5+BKVhCIepFA6zUVGuVgYNOMVy/WuRXI7/qrKMc1CLhGEzXVLNpkWJm0rlR1Q5Mmh2o92zI2jQ0OyGQcmmajjNqCxO5CSBllBE0C/ajoKUDMTxmxGfwcrMZkbzVCyktn2mDYNjZjc5TA= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=linux.beauty; spf=pass smtp.mailfrom=me@linux.beauty; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1778489043; s=zmail; d=linux.beauty; i=me@linux.beauty; h=From:From:To:To:Subject:Subject:Date:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Transfer-Encoding:Message-Id:Reply-To:Cc; bh=jTrtfIQ31WOPCDzA9NvXdREsibDv3/UeQsSkA9Q0IOk=; b=oMkfsLw4nV92aMEdD6nfOkj5/TWP65OoH8pQh1VrsbEk9Baj4vyX7m8wj9Y6jOZw TJ0NK5/7mNaD6Lh9DNtRLFk/QFctdp27PjbQOpJRs0/abGOATZYeLFvjF/yI1t9wJSD DOgbQSPLuzIIP3NhHKoaExVpBY9matFDA+PTnsyA= Received: by mx.zohomail.com with SMTPS id 1778489039670558.8949207687377; Mon, 11 May 2026 01:43:59 -0700 (PDT) From: Li Chen To: Zhang Yi , Theodore Ts'o , Andreas Dilger , Baokun Li , Jan Kara , Ojaswin Mujoo , "Ritesh Harjani (IBM)" , Zhang Yi , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Subject: [RFC v7 6/7] ext4: fast commit: add lock_updates tracepoint Date: Mon, 11 May 2026 16:43:01 +0800 Message-ID: <20260511084304.1559557-7-me@linux.beauty> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260511084304.1559557-1-me@linux.beauty> References: <20260511084304.1559557-1-me@linux.beauty> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Content-Type: text/plain; charset="utf-8" Commit-time fast commit snapshots run under jbd2_journal_lock_updates(), so it is useful to quantify the time spent with updates locked and to understand why snapshotting can fail. Add a new tracepoint, ext4_fc_lock_updates, reporting the time spent in the updates-locked window along with the number of snapshotted inodes and ranges. Record the first snapshot failure reason in a stable snap_err field for tooling. Signed-off-by: Li Chen Reviewed-by: Steven Rostedt (Google) --- Changes in v7: - Address Sashiko review by reporting successfully snapshotted inode counts in ext4_fc_lock_updates when snapshotting stops early. Changes in v6: - Drop explicit ext4_fc_snap_err assignments and rely on enum auto-increment. - Treat locked_ns as trace-only in this patch and calculate it only when ext4_fc_lock_updates is enabled, as suggested by Steven Rostedt. fs/ext4/ext4.h | 15 ++++++++ fs/ext4/fast_commit.c | 74 +++++++++++++++++++++++++++++-------- include/trace/events/ext4.h | 61 ++++++++++++++++++++++++++++++ 3 files changed, 135 insertions(+), 15 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 2a706acdfaf8..df30f8705c98 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1027,6 +1027,21 @@ enum { =20 struct ext4_fc_inode_snap; =20 +/* + * Snapshot failure reasons for ext4_fc_lock_updates tracepoint. + * Keep these stable for tooling. + */ +enum ext4_fc_snap_err { + EXT4_FC_SNAP_ERR_NONE =3D 0, + EXT4_FC_SNAP_ERR_ES_MISS, + EXT4_FC_SNAP_ERR_ES_DELAYED, + EXT4_FC_SNAP_ERR_ES_OTHER, + EXT4_FC_SNAP_ERR_INODES_CAP, + EXT4_FC_SNAP_ERR_RANGES_CAP, + EXT4_FC_SNAP_ERR_NOMEM, + EXT4_FC_SNAP_ERR_INODE_LOC, +}; + /* * fourth extended file system inode data in memory */ diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index 9fc17c1fa7af..c24984d8df83 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -194,6 +194,12 @@ static struct kmem_cache *ext4_fc_range_cachep; #define EXT4_FC_SNAPSHOT_MAX_INODES 1024 #define EXT4_FC_SNAPSHOT_MAX_RANGES 2048 =20 +static inline void ext4_fc_set_snap_err(int *snap_err, int err) +{ + if (snap_err && *snap_err =3D=3D EXT4_FC_SNAP_ERR_NONE) + *snap_err =3D err; +} + static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate) { BUFFER_TRACE(bh, ""); @@ -967,11 +973,12 @@ static void ext4_fc_free_inode_snap(struct inode *ino= de) static int ext4_fc_snapshot_inode_data(struct inode *inode, struct list_head *ranges, unsigned int nr_ranges_total, - unsigned int *nr_rangesp) + unsigned int *nr_rangesp, + int *snap_err) { struct ext4_inode_info *ei =3D EXT4_I(inode); - unsigned int nr_ranges =3D 0; ext4_lblk_t start_lblk, end_lblk, cur_lblk; + unsigned int nr_ranges =3D 0; =20 spin_lock(&ei->i_fc_lock); if (ei->i_fc_lblk_len =3D=3D 0) { @@ -996,11 +1003,16 @@ static int ext4_fc_snapshot_inode_data(struct inode = *inode, ext4_lblk_t len; u64 remaining =3D (u64)end_lblk - cur_lblk + 1; =20 - if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) + if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) { + ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS); return -EAGAIN; + } =20 - if (ext4_es_is_delayed(&es)) + if (ext4_es_is_delayed(&es)) { + ext4_fc_set_snap_err(snap_err, + EXT4_FC_SNAP_ERR_ES_DELAYED); return -EAGAIN; + } =20 len =3D es.es_len - (cur_lblk - es.es_lblk); if (len > remaining) @@ -1010,12 +1022,17 @@ static int ext4_fc_snapshot_inode_data(struct inode= *inode, continue; } =20 - if (nr_ranges_total + nr_ranges >=3D EXT4_FC_SNAPSHOT_MAX_RANGES) + if (nr_ranges_total + nr_ranges >=3D EXT4_FC_SNAPSHOT_MAX_RANGES) { + ext4_fc_set_snap_err(snap_err, + EXT4_FC_SNAP_ERR_RANGES_CAP); return -E2BIG; + } =20 range =3D kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS); - if (!range) + if (!range) { + ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM); return -ENOMEM; + } nr_ranges++; =20 range->lblk =3D cur_lblk; @@ -1040,6 +1057,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *= inode, range->len =3D max; } else { kmem_cache_free(ext4_fc_range_cachep, range); + ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER); return -EAGAIN; } =20 @@ -1059,7 +1077,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *= inode, =20 static int ext4_fc_snapshot_inode(struct inode *inode, unsigned int nr_ranges_total, - unsigned int *nr_rangesp) + unsigned int *nr_rangesp, int *snap_err) { struct ext4_inode_info *ei =3D EXT4_I(inode); struct ext4_fc_inode_snap *snap; @@ -1071,8 +1089,10 @@ static int ext4_fc_snapshot_inode(struct inode *inod= e, int alloc_ctx; =20 ret =3D ext4_get_inode_loc_noio(inode, &iloc); - if (ret) + if (ret) { + ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC); return ret; + } =20 if (ext4_test_inode_flag(inode, EXT4_INODE_INLINE_DATA)) inode_len =3D EXT4_INODE_SIZE(inode->i_sb); @@ -1081,6 +1101,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode, =20 snap =3D kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS); if (!snap) { + ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM); brelse(iloc.bh); return -ENOMEM; } @@ -1091,7 +1112,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode, brelse(iloc.bh); =20 ret =3D ext4_fc_snapshot_inode_data(inode, &ranges, nr_ranges_total, - &nr_ranges); + &nr_ranges, snap_err); if (ret) { kfree(snap); ext4_fc_free_ranges(&ranges); @@ -1192,7 +1213,10 @@ static int ext4_fc_alloc_snapshot_inodes(struct supe= r_block *sb, unsigned int *nr_inodesp); =20 static int ext4_fc_snapshot_inodes(journal_t *journal, struct inode **inod= es, - unsigned int inodes_size) + unsigned int inodes_size, + unsigned int *nr_inodesp, + unsigned int *nr_rangesp, + int *snap_err) { struct super_block *sb =3D journal->j_private; struct ext4_sb_info *sbi =3D EXT4_SB(sb); @@ -1210,6 +1234,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal= , struct inode **inodes, alloc_ctx =3D ext4_fc_lock(sb); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { if (i >=3D inodes_size) { + ext4_fc_set_snap_err(snap_err, + EXT4_FC_SNAP_ERR_INODES_CAP); ret =3D -E2BIG; goto unlock; } @@ -1233,6 +1259,8 @@ static int ext4_fc_snapshot_inodes(journal_t *journal= , struct inode **inodes, continue; =20 if (i >=3D inodes_size) { + ext4_fc_set_snap_err(snap_err, + EXT4_FC_SNAP_ERR_INODES_CAP); ret =3D -E2BIG; goto unlock; } @@ -1257,16 +1285,20 @@ static int ext4_fc_snapshot_inodes(journal_t *journ= al, struct inode **inodes, unsigned int inode_ranges =3D 0; =20 ret =3D ext4_fc_snapshot_inode(inodes[idx], nr_ranges, - &inode_ranges); + &inode_ranges, snap_err); if (ret) break; nr_ranges +=3D inode_ranges; } =20 + if (nr_inodesp) + *nr_inodesp =3D idx; + if (nr_rangesp) + *nr_rangesp =3D nr_ranges; return ret; } =20 -static int ext4_fc_perform_commit(journal_t *journal) +static int ext4_fc_perform_commit(journal_t *journal, tid_t commit_tid) { struct super_block *sb =3D journal->j_private; struct ext4_sb_info *sbi =3D EXT4_SB(sb); @@ -1275,10 +1307,15 @@ static int ext4_fc_perform_commit(journal_t *journa= l) struct inode *inode; struct inode **inodes; unsigned int inodes_size; + unsigned int snap_inodes =3D 0; + unsigned int snap_ranges =3D 0; + int snap_err =3D EXT4_FC_SNAP_ERR_NONE; struct blk_plug plug; int ret =3D 0; u32 crc =3D 0; int alloc_ctx; + ktime_t lock_start; + u64 locked_ns; =20 /* * Step 1: Mark all inodes on s_fc_q[MAIN] with @@ -1326,13 +1363,13 @@ static int ext4_fc_perform_commit(journal_t *journa= l) if (ret) return ret; =20 - ret =3D ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size); if (ret) return ret; =20 /* Step 4: Mark all inodes as being committed. */ jbd2_journal_lock_updates(journal); + lock_start =3D ktime_get(); /* * The journal is now locked. No more handles can start and all the * previous handles are now drained. Snapshotting happens in this @@ -1346,8 +1383,15 @@ static int ext4_fc_perform_commit(journal_t *journal) } ext4_fc_unlock(sb, alloc_ctx); =20 - ret =3D ext4_fc_snapshot_inodes(journal, inodes, inodes_size); + ret =3D ext4_fc_snapshot_inodes(journal, inodes, inodes_size, + &snap_inodes, &snap_ranges, &snap_err); jbd2_journal_unlock_updates(journal); + if (trace_ext4_fc_lock_updates_enabled()) { + locked_ns =3D ktime_to_ns(ktime_sub(ktime_get(), lock_start)); + trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns, + snap_inodes, snap_ranges, ret, + snap_err); + } kvfree(inodes); if (ret) return ret; @@ -1552,7 +1596,7 @@ int ext4_fc_commit(journal_t *journal, tid_t commit_t= id) journal_ioprio =3D EXT4_DEF_JOURNAL_IOPRIO; set_task_ioprio(current, journal_ioprio); fc_bufs_before =3D (sbi->s_fc_bytes + bsize - 1) / bsize; - ret =3D ext4_fc_perform_commit(journal); + ret =3D ext4_fc_perform_commit(journal, commit_tid); if (ret < 0) { if (ret =3D=3D -EAGAIN || ret =3D=3D -E2BIG || ret =3D=3D -ECANCELED) status =3D EXT4_FC_STATUS_INELIGIBLE; diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index f493642cf121..7028a28316fa 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -107,6 +107,26 @@ TRACE_DEFINE_ENUM(EXT4_FC_REASON_VERITY); TRACE_DEFINE_ENUM(EXT4_FC_REASON_MOVE_EXT); TRACE_DEFINE_ENUM(EXT4_FC_REASON_MAX); =20 +#undef EM +#undef EMe +#define EM(a) TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a); +#define EMe(a) TRACE_DEFINE_ENUM(EXT4_FC_SNAP_ERR_##a); + +#define TRACE_SNAP_ERR \ + EM(NONE) \ + EM(ES_MISS) \ + EM(ES_DELAYED) \ + EM(ES_OTHER) \ + EM(INODES_CAP) \ + EM(RANGES_CAP) \ + EM(NOMEM) \ + EMe(INODE_LOC) + +TRACE_SNAP_ERR + +#undef EM +#undef EMe + #define show_fc_reason(reason) \ __print_symbolic(reason, \ { EXT4_FC_REASON_XATTR, "XATTR"}, \ @@ -2818,6 +2838,47 @@ TRACE_EVENT(ext4_fc_commit_stop, __entry->num_fc_ineligible, __entry->nblks_agg, __entry->tid) ); =20 +#define EM(a) { EXT4_FC_SNAP_ERR_##a, #a }, +#define EMe(a) { EXT4_FC_SNAP_ERR_##a, #a } + +TRACE_EVENT(ext4_fc_lock_updates, + TP_PROTO(struct super_block *sb, tid_t commit_tid, u64 locked_ns, + unsigned int nr_inodes, unsigned int nr_ranges, int err, + int snap_err), + + TP_ARGS(sb, commit_tid, locked_ns, nr_inodes, nr_ranges, err, snap_err), + + TP_STRUCT__entry(/* entry */ + __field(dev_t, dev) + __field(tid_t, tid) + __field(u64, locked_ns) + __field(unsigned int, nr_inodes) + __field(unsigned int, nr_ranges) + __field(int, err) + __field(int, snap_err) + ), + + TP_fast_assign(/* assign */ + __entry->dev =3D sb->s_dev; + __entry->tid =3D commit_tid; + __entry->locked_ns =3D locked_ns; + __entry->nr_inodes =3D nr_inodes; + __entry->nr_ranges =3D nr_ranges; + __entry->err =3D err; + __entry->snap_err =3D snap_err; + ), + + TP_printk("dev %d,%d tid %u locked_ns %llu nr_inodes %u nr_ranges %u err = %d snap_err %s", + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid, + __entry->locked_ns, __entry->nr_inodes, __entry->nr_ranges, + __entry->err, __print_symbolic(__entry->snap_err, + TRACE_SNAP_ERR)) +); + +#undef EM +#undef EMe +#undef TRACE_SNAP_ERR + #define FC_REASON_NAME_STAT(reason) \ show_fc_reason(reason), \ __entry->fc_ineligible_rc[reason] --=20 2.53.0 From nobody Sat Jun 13 03:30:33 2026 Received: from sender4-op-o15.zoho.com (sender4-op-o15.zoho.com [136.143.188.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 46D4E3B27D0; Mon, 11 May 2026 08:47:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=pass smtp.client-ip=136.143.188.15 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489255; cv=pass; b=VbdkHWM+FsNHX0MGIErs+Kci2tx9feMvBXssr5lTnmYkGWQInPS+7ibnDHUkzZLgPlB8wirFHhHnZSyO62Akg5RAFCesxm01/9/Jow2/xnmYZ99IuUvJpzzZhhH70COV3JtRmL4Xv5nIROh5n6yc+2KSFtEoP6A6dxuuWHPPdh4= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778489255; c=relaxed/simple; bh=9+r0lXSAyXV2axYn2KPYR7C9ZTyqeMxFEhiqo0oAZvk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RH6sKICgY3iXRM1r6KeKUgqERV7SDzzsRhb77Xx2TsXDqCzqZ8Q5PimNOOUlSv1P+cTOgrGeHJ80OkIg6sLCTIiAhI5rqWALErDDE5USa9wcYw9Qlsg4En43X04zK6WINJwbeXJuKF1j3pOFwPlzGbwIFlJQCRgOsT0ta9TRKLk= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty; spf=pass smtp.mailfrom=linux.beauty; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b=nUXvM8ZR; arc=pass smtp.client-ip=136.143.188.15 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.beauty Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.beauty Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.beauty header.i=me@linux.beauty header.b="nUXvM8ZR" ARC-Seal: i=1; a=rsa-sha256; t=1778489047; cv=none; d=zohomail.com; s=zohoarc; b=WyoS015f3djO5Y4dPKyoo2z5zTgdpeR9xI8K1zEiZkliUlo6KwM+NCO/F8PAO+rJpZc0dE0IWU+I/ZbFNnM+mOqdbFCdDp2VFcTWHCABrkpcERo6Br7K1uDXicm6dr19sxSV96bxDusQ9zqdFyMVad4AKZxm4o1+VtMXgGzfFnI= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1778489047; h=Content-Transfer-Encoding:Cc:Cc:Date:Date:From:From:In-Reply-To:MIME-Version:Message-ID:References:Subject:Subject:To:To:Message-Id:Reply-To; bh=aVVT3ps1SdnisVf/bD8Ws+M3UP7WXtuL/UZyauh6zA8=; b=LVVWsztFjRqId3BBkzbXKcw4aaT2pR75mwEVaeMcluVrL+Jj6WipGc4dtk91WP8dgqM0hL5xgt99Hjk/zuSPAXaNE1HJQjkqMM+wiC6ERdfkGxKtoHawBeBOKBR1s+jMFJOV3Eo4Ejbl5o+dGX8mE47sUB0Ub5q2wCHz4c+xjYY= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass header.i=linux.beauty; spf=pass smtp.mailfrom=me@linux.beauty; dmarc=pass header.from= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1778489047; s=zmail; d=linux.beauty; i=me@linux.beauty; h=From:From:To:To:Cc:Cc:Subject:Subject:Date:Date:Message-ID:In-Reply-To:References:MIME-Version:Content-Transfer-Encoding:Message-Id:Reply-To; bh=aVVT3ps1SdnisVf/bD8Ws+M3UP7WXtuL/UZyauh6zA8=; b=nUXvM8ZRK+dzw0bset72KC3dWKBot9hlgSJEeaExS8y8eRePJakBfC4VRiYhaqC5 9uR0jFqDFLVq5fG7jy2I9EvQJ7bzGrWZ0/FFsw/88sQDctlzz367mG+h5eZDdatLx08 mPau/xvOuULLdimv0MuwdHjd/KCgldlLScDBo090= Received: by mx.zohomail.com with SMTPS id 1778489044493301.0632124891529; Mon, 11 May 2026 01:44:04 -0700 (PDT) From: Li Chen To: Zhang Yi , Theodore Ts'o , Andreas Dilger , Baokun Li , Jan Kara , Ojaswin Mujoo , "Ritesh Harjani (IBM)" , Zhang Yi , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org Cc: Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , linux-trace-kernel@vger.kernel.org Subject: [RFC v7 7/7] ext4: fast commit: export snapshot stats in fc_info Date: Mon, 11 May 2026 16:43:02 +0800 Message-ID: <20260511084304.1559557-8-me@linux.beauty> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260511084304.1559557-1-me@linux.beauty> References: <20260511084304.1559557-1-me@linux.beauty> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ZohoMailClient: External Content-Type: text/plain; charset="utf-8" Snapshot-based fast commit can fall back when the commit-time snapshot cannot be built (e.g. extent status cache misses). It is useful to quantify the updates-locked window and to see why snapshotting failed. Add best-effort snapshot counters to the ext4 superblock and extend /proc/fs/ext4//fc_info to report the number of snapshotted inodes and ranges, snapshot failure reasons, and the average/max time spent with journal updates locked. Signed-off-by: Li Chen --- Changes in v7: - Address Sashiko review by using READ_ONCE() + div64_u64() for the fc_info lock_updates average. Changes in v6: - Start consuming locked_ns in fc_info, so this patch intentionally moves lock_updates_ns_{total,max,samples} accounting here. - Guard the tracepoint call with trace_ext4_fc_lock_updates_enabled() and use trace_call__ext4_fc_lock_updates() to avoid the double static_branch at the guarded call site. - Keep the stats unconditionally while avoiding extra tracepoint overhead when ext4_fc_lock_updates is disabled. fs/ext4/ext4.h | 31 +++++++++++++++++ fs/ext4/fast_commit.c | 78 +++++++++++++++++++++++++++++++++++++------ fs/ext4/super.c | 1 + 3 files changed, 100 insertions(+), 10 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index df30f8705c98..3457b4950c02 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1550,6 +1550,36 @@ struct ext4_orphan_info { * file blocks */ }; =20 +/* + * Ext4 fast commit snapshot statistics. + * + * These are best-effort counters intended for debugging / performance + * introspection; they are not exact under concurrent updates. + */ +struct ext4_fc_snap_stats { + u64 lock_updates_ns_total; + u64 lock_updates_ns_max; + u64 lock_updates_samples; + + u64 snap_inodes; + u64 snap_ranges; + + u64 snap_fail_es_miss; + u64 snap_fail_es_delayed; + u64 snap_fail_es_other; + + u64 snap_fail_inodes_cap; + u64 snap_fail_ranges_cap; + u64 snap_fail_nomem; + u64 snap_fail_inode_loc; + + /* + * Missing inode snapshots during log writing should never happen. + * Keep this counter to help catch unexpected regressions. + */ + u64 snap_fail_no_snap; +}; + /* * fourth extended-fs super-block data in memory */ @@ -1824,6 +1854,7 @@ struct ext4_sb_info { struct mutex s_fc_lock; struct buffer_head *s_fc_bh; struct ext4_fc_stats s_fc_stats; + struct ext4_fc_snap_stats s_fc_snap_stats; tid_t s_fc_ineligible_tid; #ifdef CONFIG_EXT4_DEBUG int s_fc_debug_max_replay; diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c index c24984d8df83..1dfcccf4179e 100644 --- a/fs/ext4/fast_commit.c +++ b/fs/ext4/fast_commit.c @@ -874,13 +874,17 @@ static int ext4_fc_write_inode(struct inode *inode, u= 32 *crc) int inode_len; int ret; =20 - if (!snap) + if (!snap) { + EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++; return -ECANCELED; + } =20 src =3D snap->inode_buf; inode_len =3D snap->inode_len; - if (!src || inode_len =3D=3D 0) + if (!src || inode_len =3D=3D 0) { + EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++; return -ECANCELED; + } =20 fc_inode.fc_ino =3D cpu_to_le32(inode->i_ino); tl.fc_tag =3D cpu_to_le16(EXT4_FC_TAG_INODE); @@ -915,8 +919,10 @@ static int ext4_fc_write_inode_data(struct inode *inod= e, u32 *crc) struct ext4_extent *ex; struct ext4_fc_range *range; =20 - if (!snap) + if (!snap) { + EXT4_SB(inode->i_sb)->s_fc_snap_stats.snap_fail_no_snap++; return -ECANCELED; + } =20 list_for_each_entry(range, &snap->data_list, list) { if (range->tag =3D=3D EXT4_FC_TAG_DEL_RANGE) { @@ -977,6 +983,8 @@ static int ext4_fc_snapshot_inode_data(struct inode *in= ode, int *snap_err) { struct ext4_inode_info *ei =3D EXT4_I(inode); + struct ext4_fc_snap_stats *stats =3D + &EXT4_SB(inode->i_sb)->s_fc_snap_stats; ext4_lblk_t start_lblk, end_lblk, cur_lblk; unsigned int nr_ranges =3D 0; =20 @@ -1004,11 +1012,13 @@ static int ext4_fc_snapshot_inode_data(struct inode= *inode, u64 remaining =3D (u64)end_lblk - cur_lblk + 1; =20 if (!ext4_es_lookup_extent(inode, cur_lblk, NULL, &es, NULL)) { + stats->snap_fail_es_miss++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_MISS); return -EAGAIN; } =20 if (ext4_es_is_delayed(&es)) { + stats->snap_fail_es_delayed++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_DELAYED); return -EAGAIN; @@ -1023,6 +1033,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *= inode, } =20 if (nr_ranges_total + nr_ranges >=3D EXT4_FC_SNAPSHOT_MAX_RANGES) { + stats->snap_fail_ranges_cap++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_RANGES_CAP); return -E2BIG; @@ -1030,6 +1041,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *= inode, =20 range =3D kmem_cache_alloc(ext4_fc_range_cachep, GFP_NOFS); if (!range) { + stats->snap_fail_nomem++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM); return -ENOMEM; } @@ -1057,6 +1069,7 @@ static int ext4_fc_snapshot_inode_data(struct inode *= inode, range->len =3D max; } else { kmem_cache_free(ext4_fc_range_cachep, range); + stats->snap_fail_es_other++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_ES_OTHER); return -EAGAIN; } @@ -1080,6 +1093,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode, unsigned int *nr_rangesp, int *snap_err) { struct ext4_inode_info *ei =3D EXT4_I(inode); + struct ext4_fc_snap_stats *stats =3D + &EXT4_SB(inode->i_sb)->s_fc_snap_stats; struct ext4_fc_inode_snap *snap; int inode_len =3D EXT4_GOOD_OLD_INODE_SIZE; struct ext4_iloc iloc; @@ -1090,6 +1105,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode, =20 ret =3D ext4_get_inode_loc_noio(inode, &iloc); if (ret) { + stats->snap_fail_inode_loc++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODE_LOC); return ret; } @@ -1101,6 +1117,7 @@ static int ext4_fc_snapshot_inode(struct inode *inode, =20 snap =3D kmalloc(struct_size(snap, inode_buf, inode_len), GFP_NOFS); if (!snap) { + stats->snap_fail_nomem++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_NOMEM); brelse(iloc.bh); return -ENOMEM; @@ -1125,6 +1142,8 @@ static int ext4_fc_snapshot_inode(struct inode *inode, list_splice_tail_init(&ranges, &snap->data_list); ext4_fc_unlock(inode->i_sb, alloc_ctx); =20 + stats->snap_inodes++; + stats->snap_ranges +=3D nr_ranges; if (nr_rangesp) *nr_rangesp =3D nr_ranges; return 0; @@ -1234,6 +1253,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal= , struct inode **inodes, alloc_ctx =3D ext4_fc_lock(sb); list_for_each_entry(iter, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) { if (i >=3D inodes_size) { + sbi->s_fc_snap_stats.snap_fail_inodes_cap++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODES_CAP); ret =3D -E2BIG; @@ -1259,6 +1279,7 @@ static int ext4_fc_snapshot_inodes(journal_t *journal= , struct inode **inodes, continue; =20 if (i >=3D inodes_size) { + sbi->s_fc_snap_stats.snap_fail_inodes_cap++; ext4_fc_set_snap_err(snap_err, EXT4_FC_SNAP_ERR_INODES_CAP); ret =3D -E2BIG; @@ -1302,6 +1323,7 @@ static int ext4_fc_perform_commit(journal_t *journal,= tid_t commit_tid) { struct super_block *sb =3D journal->j_private; struct ext4_sb_info *sbi =3D EXT4_SB(sb); + struct ext4_fc_snap_stats *snap_stats =3D &sbi->s_fc_snap_stats; struct ext4_inode_info *iter; struct ext4_fc_head head; struct inode *inode; @@ -1364,8 +1386,13 @@ static int ext4_fc_perform_commit(journal_t *journal= , tid_t commit_tid) return ret; =20 ret =3D ext4_fc_alloc_snapshot_inodes(sb, &inodes, &inodes_size); - if (ret) + if (ret) { + if (ret =3D=3D -E2BIG) + snap_stats->snap_fail_inodes_cap++; + else if (ret =3D=3D -ENOMEM) + snap_stats->snap_fail_nomem++; return ret; + } =20 /* Step 4: Mark all inodes as being committed. */ jbd2_journal_lock_updates(journal); @@ -1386,12 +1413,15 @@ static int ext4_fc_perform_commit(journal_t *journa= l, tid_t commit_tid) ret =3D ext4_fc_snapshot_inodes(journal, inodes, inodes_size, &snap_inodes, &snap_ranges, &snap_err); jbd2_journal_unlock_updates(journal); - if (trace_ext4_fc_lock_updates_enabled()) { - locked_ns =3D ktime_to_ns(ktime_sub(ktime_get(), lock_start)); - trace_ext4_fc_lock_updates(sb, commit_tid, locked_ns, - snap_inodes, snap_ranges, ret, - snap_err); - } + locked_ns =3D ktime_to_ns(ktime_sub(ktime_get(), lock_start)); + snap_stats->lock_updates_ns_total +=3D locked_ns; + snap_stats->lock_updates_samples++; + if (locked_ns > snap_stats->lock_updates_ns_max) + snap_stats->lock_updates_ns_max =3D locked_ns; + if (trace_ext4_fc_lock_updates_enabled()) + trace_call__ext4_fc_lock_updates(sb, commit_tid, locked_ns, + snap_inodes, snap_ranges, + ret, snap_err); kvfree(inodes); if (ret) return ret; @@ -2667,11 +2697,23 @@ int ext4_fc_info_show(struct seq_file *seq, void *v) { struct ext4_sb_info *sbi =3D EXT4_SB((struct super_block *)seq->private); struct ext4_fc_stats *stats =3D &sbi->s_fc_stats; + struct ext4_fc_snap_stats *snap_stats =3D &sbi->s_fc_snap_stats; + u64 lock_avg_ns =3D 0; + u64 lock_updates_samples; + u64 lock_updates_ns_total; + u64 lock_updates_ns_max; int i; =20 if (v !=3D SEQ_START_TOKEN) return 0; =20 + lock_updates_samples =3D READ_ONCE(snap_stats->lock_updates_samples); + lock_updates_ns_total =3D READ_ONCE(snap_stats->lock_updates_ns_total); + lock_updates_ns_max =3D READ_ONCE(snap_stats->lock_updates_ns_max); + if (lock_updates_samples) + lock_avg_ns =3D div64_u64(lock_updates_ns_total, + lock_updates_samples); + seq_printf(seq, "fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_= time\n", stats->fc_num_commits, stats->fc_ineligible_commits, @@ -2682,6 +2724,22 @@ int ext4_fc_info_show(struct seq_file *seq, void *v) seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i], stats->fc_ineligible_reason_count[i]); =20 + seq_printf(seq, + "Snapshot stats:\n%llu inodes\n%llu ranges\n%lluus lock_updates_avg\n= %lluus lock_updates_max\n", + snap_stats->snap_inodes, snap_stats->snap_ranges, + div_u64(lock_avg_ns, 1000), + div_u64(lock_updates_ns_max, 1000)); + seq_printf(seq, + "Snapshot failures:\n%llu es_miss\n%llu es_delayed\n%llu es_other\n%l= lu inodes_cap\n%llu ranges_cap\n%llu nomem\n%llu inode_loc\n%llu no_snap\n", + snap_stats->snap_fail_es_miss, + snap_stats->snap_fail_es_delayed, + snap_stats->snap_fail_es_other, + snap_stats->snap_fail_inodes_cap, + snap_stats->snap_fail_ranges_cap, + snap_stats->snap_fail_nomem, + snap_stats->snap_fail_inode_loc, + snap_stats->snap_fail_no_snap); + return 0; } =20 diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 3c869f0001c5..f1f8819a2a23 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -4544,6 +4544,7 @@ static void ext4_fast_commit_init(struct super_block = *sb) sbi->s_fc_ineligible_tid =3D 0; mutex_init(&sbi->s_fc_lock); memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats)); + memset(&sbi->s_fc_snap_stats, 0, sizeof(sbi->s_fc_snap_stats)); sbi->s_fc_replay_state.fc_regions =3D NULL; sbi->s_fc_replay_state.fc_regions_size =3D 0; sbi->s_fc_replay_state.fc_regions_used =3D 0; --=20 2.53.0