From nobody Mon Feb 9 16:51:17 2026 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 88FDA37F8D8; Tue, 3 Feb 2026 06:30:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770100224; cv=none; b=NFiHWIEWS7S6wxi32eynONin+7CVx6uQ2HL3c4Ya1BmrVRpH7Q7f5OHAzisnzqraX7XdFLvKMX0JV2k57mrs4ZOqw9YLS5Olz6vugkodFyAFEqsR25a/lXdgQReaz9RySVtyWaNKRLIgndMtj5c4Uuo6I2KQn3/pwagjdjyIl/g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1770100224; c=relaxed/simple; bh=b/XHEgx5q6NvWz/lJzuZYF78KNDLPHYXM3TY2NlNfHU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cZ9xh1ml66BqNC0whY3IJmKRHMTqozMTnvaTQmLWV13nBkTy4mkv45p+SOoMOW66O3NzwmahNof2Sq+j7p3nLzg9ukGjF7Kw0V1ayZ1RVpFcS/Uikpa07572y+eYQIyji9O/G5sUI6qPtCuerRnPUOZqdHc7CPTCbjZ5k1WKIA4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.163.177]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTPS id 4f4trK2XzlzKHMc3; Tue, 3 Feb 2026 14:29:53 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id A743E4058D; Tue, 3 Feb 2026 14:30:14 +0800 (CST) Received: from huaweicloud.com (unknown [10.50.85.155]) by APP4 (Coremail) with SMTP id gCh0CgAHaPjnlYFpiadbGA--.27803S16; Tue, 03 Feb 2026 14:30:14 +0800 (CST) From: Zhang Yi To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, ojaswin@linux.ibm.com, ritesh.list@gmail.com, hch@infradead.org, djwong@kernel.org, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, yizhang089@gmail.com, libaokun1@huawei.com, yangerkun@huawei.com, yukuai@fnnas.com Subject: [PATCH -next v2 12/22] ext4: implement buffered write iomap path Date: Tue, 3 Feb 2026 14:25:12 +0800 Message-ID: <20260203062523.3869120-13-yi.zhang@huawei.com> X-Mailer: git-send-email 2.52.0 In-Reply-To: <20260203062523.3869120-1-yi.zhang@huawei.com> References: <20260203062523.3869120-1-yi.zhang@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-CM-TRANSID: gCh0CgAHaPjnlYFpiadbGA--.27803S16 X-Coremail-Antispam: 1UD129KBjvJXoW3uFyUAF4rZF4fJr13Jw4UXFb_yoWkKw1xpa s0kry5GFsrXr97uF4ftF4UZr1F93WxtrW7CrW3Wrn8XryqyrWIqF48KFyayF15trZ7Cr4j qF4jkry8Wr4UCrDanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUHqb4IE77IF4wAFF20E14v26rWj6s0DM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28IrcIa0xkI8VA2jI8067AKxVWUAV Cq3wA2048vs2IY020Ec7CjxVAFwI0_Xr0E3s1l8cAvFVAK0II2c7xJM28CjxkF64kEwVA0 rcxSw2x7M28EF7xvwVC0I7IYx2IY67AKxVWDJVCq3wA2z4x0Y4vE2Ix0cI8IcVCY1x0267 AKxVW8Jr0_Cr1UM28EF7xvwVC2z280aVAFwI0_GcCE3s1l84ACjcxK6I8E87Iv6xkF7I0E 14v26rxl6s0DM2AIxVAIcxkEcVAq07x20xvEncxIr21l5I8CrVACY4xI64kE6c02F40Ex7 xfMcIj6xIIjxv20xvE14v26r1j6r18McIj6I8E87Iv67AKxVWUJVW8JwAm72CE4IkC6x0Y z7v_Jr0_Gr1lF7xvr2IYc2Ij64vIr41lF7Iv64x0x7Aq67IIx4CEVc8vx2IErcIFxwACI4 02YVCY1x02628vn2kIc2xKxwCY1x0262kKe7AKxVWUtVW8ZwCF04k20xvY0x0EwIxGrwCF 04k20xvEw4C26cxK6c8Ij28IcwCFx2IqxVCFs4IE7xkEbVWUJVW8JwC20s026c02F40E14 v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E67AF67kF1VAFwI0_GFv_WrylIxkG c2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVW8JVW5JwCI42IY6xIIjxv20xvEc7CjxVAFwI 0_Gr1j6F4UJwCI42IY6xAIw20EY4v20xvaj40_Jr0_JF4lIxAIcVC2z280aVAFwI0_Gr0_ Cr1lIxAIcVC2z280aVCY1x0267AKxVW8Jr0_Cr1UYxBIdaVFxhVjvjDU0xZFpf9x07UZyC LUUUUU= Sender: yi.zhang@huaweicloud.com X-CM-SenderInfo: d1lo6xhdqjqx5xdzvxpfor3voofrz/ Content-Type: text/plain; charset="utf-8" Introduce two new iomap_ops instances, ext4_iomap_buffer_write_ops and ext4_iomap_buffer_da_write_ops, to implement the iomap write paths for ext4. ext4_iomap_buffer_da_write_begin() invokes ext4_da_map_blocks() to map delayed allocation extents and ext4_iomap_buffer_write_begin() invokes ext4_iomap_get_blocks() to directly allocate blocks in non-delayed allocation mode. Additionally, add ext4_iomap_valid() to check the validity of extents by iomap infrastructure. Key notes: - Since we don't use ordered data mode to prevent exposing stale data in the non-delayed allocation path, we ignore the dioread_nolock mount option and always allocate unwritten extents for new blocks. - The iomap write path maps multiple blocks at a time in the iomap_begin() callbacks, so we must remove the stale delayed allocation range in case of short writes and write failures. Otherwise, this could result in a range of delayed extents being covered by a clean folio, which would lead to inaccurate space reservation. - The lock ordering of the folio lock and transaction start is the opposite of that in the buffer_head buffered write path, update the locking document as well. Signed-off-by: Zhang Yi --- fs/ext4/ext4.h | 4 ++ fs/ext4/file.c | 20 +++++- fs/ext4/inode.c | 173 +++++++++++++++++++++++++++++++++++++++++++++++- fs/ext4/super.c | 10 ++- 4 files changed, 200 insertions(+), 7 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 4930446cfec1..89059b15ee5c 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3062,6 +3062,7 @@ int ext4_walk_page_buffers(handle_t *handle, int do_journal_get_write_access(handle_t *handle, struct inode *inode, struct buffer_head *bh); void ext4_set_inode_mapping_order(struct inode *inode); +int ext4_nonda_switch(struct super_block *sb); #define FALL_BACK_TO_NONDELALLOC 1 #define CONVERT_INLINE_DATA 2 =20 @@ -3930,6 +3931,9 @@ static inline void ext4_clear_io_unwritten_flag(ext4_= io_end_t *io_end) =20 extern const struct iomap_ops ext4_iomap_ops; extern const struct iomap_ops ext4_iomap_report_ops; +extern const struct iomap_ops ext4_iomap_buffered_write_ops; +extern const struct iomap_ops ext4_iomap_buffered_da_write_ops; +extern const struct iomap_write_ops ext4_iomap_write_ops; =20 static inline int ext4_buffer_uptodate(struct buffer_head *bh) { diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 3ecc09f286e4..11fbc607d332 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -303,6 +303,21 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, s= truct iov_iter *from) return count; } =20 +static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb, + struct iov_iter *from) +{ + struct inode *inode =3D file_inode(iocb->ki_filp); + const struct iomap_ops *iomap_ops; + + if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb)) + iomap_ops =3D &ext4_iomap_buffered_da_write_ops; + else + iomap_ops =3D &ext4_iomap_buffered_write_ops; + + return iomap_file_buffered_write(iocb, from, iomap_ops, + &ext4_iomap_write_ops, NULL); +} + static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, struct iov_iter *from) { @@ -317,7 +332,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *= iocb, if (ret <=3D 0) goto out; =20 - ret =3D generic_perform_write(iocb, from); + if (ext4_inode_buffered_iomap(inode)) + ret =3D ext4_iomap_buffered_write(iocb, from); + else + ret =3D generic_perform_write(iocb, from); =20 out: inode_unlock(inode); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c9489978358e..da4fd62c6963 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3065,7 +3065,7 @@ static int ext4_dax_writepages(struct address_space *= mapping, return ret; } =20 -static int ext4_nonda_switch(struct super_block *sb) +int ext4_nonda_switch(struct super_block *sb) { s64 free_clusters, dirty_clusters; struct ext4_sb_info *sbi =3D EXT4_SB(sb); @@ -3462,6 +3462,15 @@ static bool ext4_inode_datasync_dirty(struct inode *= inode) return inode_state_read_once(inode) & I_DIRTY_DATASYNC; } =20 +static bool ext4_iomap_valid(struct inode *inode, const struct iomap *ioma= p) +{ + return iomap->validity_cookie =3D=3D READ_ONCE(EXT4_I(inode)->i_es_seq); +} + +const struct iomap_write_ops ext4_iomap_write_ops =3D { + .iomap_valid =3D ext4_iomap_valid, +}; + static void ext4_set_iomap(struct inode *inode, struct iomap *iomap, struct ext4_map_blocks *map, loff_t offset, loff_t length, unsigned int flags) @@ -3496,6 +3505,8 @@ static void ext4_set_iomap(struct inode *inode, struc= t iomap *iomap, !ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) iomap->flags |=3D IOMAP_F_MERGED; =20 + iomap->validity_cookie =3D map->m_seq; + /* * Flags passed to ext4_map_blocks() for direct I/O writes can result * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits @@ -3903,8 +3914,12 @@ const struct iomap_ops ext4_iomap_report_ops =3D { .iomap_begin =3D ext4_iomap_begin_report, }; =20 +/* Map blocks */ +typedef int (ext4_get_blocks_t)(struct inode *, struct ext4_map_blocks *); + static int ext4_iomap_map_blocks(struct inode *inode, loff_t offset, - loff_t length, struct ext4_map_blocks *map) + loff_t length, ext4_get_blocks_t get_blocks, + struct ext4_map_blocks *map) { u8 blkbits =3D inode->i_blkbits; =20 @@ -3916,6 +3931,9 @@ static int ext4_iomap_map_blocks(struct inode *inode,= loff_t offset, map->m_len =3D min_t(loff_t, (offset + length - 1) >> blkbits, EXT4_MAX_LOGICAL_BLOCK) - map->m_lblk + 1; =20 + if (get_blocks) + return get_blocks(inode, map); + return ext4_map_blocks(NULL, inode, map, 0); } =20 @@ -3933,7 +3951,91 @@ static int ext4_iomap_buffered_read_begin(struct ino= de *inode, loff_t offset, if (WARN_ON_ONCE(ext4_has_inline_data(inode))) return -ERANGE; =20 - ret =3D ext4_iomap_map_blocks(inode, offset, length, &map); + ret =3D ext4_iomap_map_blocks(inode, offset, length, NULL, &map); + if (ret < 0) + return ret; + + ext4_set_iomap(inode, iomap, &map, offset, length, flags); + return 0; +} + +static int ext4_iomap_get_blocks(struct inode *inode, + struct ext4_map_blocks *map) +{ + loff_t i_size =3D i_size_read(inode); + handle_t *handle; + int ret, needed_blocks; + + /* + * Check if the blocks have already been allocated, this could + * avoid initiating a new journal transaction and return the + * mapping information directly. + */ + if ((map->m_lblk + map->m_len) <=3D + round_up(i_size, i_blocksize(inode)) >> inode->i_blkbits) { + ret =3D ext4_map_blocks(NULL, inode, map, 0); + if (ret < 0) + return ret; + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN | + EXT4_MAP_DELAYED)) + return 0; + } + + /* + * Reserve one block more for addition to orphan list in case + * we allocate blocks but write fails for some reason. + */ + needed_blocks =3D ext4_chunk_trans_blocks(inode, map->m_len) + 1; + handle =3D ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + ret =3D ext4_map_blocks(handle, inode, map, + EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT); + /* + * We have to stop handle here for two reasons. + * + * - One is a potential deadlock caused by the subsequent call to + * balance_dirty_pages(). It may wait for the dirty pages to be + * written back, which could initiate another handle and cause it + * to wait for the current one to complete. + * + * - Another one is that we cannot hole lock folio under an active + * handle because the lock order of iomap is always acquires the + * folio lock before starting a new handle; otherwise, this could + * cause a potential deadlock. + */ + ext4_journal_stop(handle); + + return ret; +} + +static int ext4_iomap_buffered_do_write_begin(struct inode *inode, + loff_t offset, loff_t length, unsigned int flags, + struct iomap *iomap, struct iomap *srcmap, bool delalloc) +{ + int ret, retries =3D 0; + struct ext4_map_blocks map; + ext4_get_blocks_t *get_blocks; + + ret =3D ext4_emergency_state(inode->i_sb); + if (unlikely(ret)) + return ret; + + /* Inline data support is not yet available. */ + if (WARN_ON_ONCE(ext4_has_inline_data(inode))) + return -ERANGE; + if (WARN_ON_ONCE(!(flags & IOMAP_WRITE))) + return -EINVAL; + + if (delalloc) + get_blocks =3D ext4_da_map_blocks; + else + get_blocks =3D ext4_iomap_get_blocks; +retry: + ret =3D ext4_iomap_map_blocks(inode, offset, length, get_blocks, &map); + if (ret =3D=3D -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries)) + goto retry; if (ret < 0) return ret; =20 @@ -3941,6 +4043,71 @@ static int ext4_iomap_buffered_read_begin(struct ino= de *inode, loff_t offset, return 0; } =20 +static int ext4_iomap_buffered_write_begin(struct inode *inode, + loff_t offset, loff_t length, unsigned int flags, + struct iomap *iomap, struct iomap *srcmap) +{ + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags, + iomap, srcmap, false); +} + +static int ext4_iomap_buffered_da_write_begin(struct inode *inode, + loff_t offset, loff_t length, unsigned int flags, + struct iomap *iomap, struct iomap *srcmap) +{ + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags, + iomap, srcmap, true); +} + +/* + * Drop the staled delayed allocation range from the write failure, + * including both start and end blocks. If not, we could leave a range + * of delayed extents covered by a clean folio, it could lead to + * inaccurate space reservation. + */ +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset, + loff_t length, struct iomap *iomap) +{ + down_write(&EXT4_I(inode)->i_data_sem); + ext4_es_remove_extent(inode, offset >> inode->i_blkbits, + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb))); + up_write(&EXT4_I(inode)->i_data_sem); +} + +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t of= fset, + loff_t length, ssize_t written, + unsigned int flags, + struct iomap *iomap) +{ + loff_t start_byte, end_byte; + + /* If we didn't reserve the blocks, we're not allowed to punch them. */ + if (iomap->type !=3D IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW)) + return 0; + + /* Nothing to do if we've written the entire delalloc extent */ + start_byte =3D iomap_last_written_block(inode, offset, written); + end_byte =3D round_up(offset + length, i_blocksize(inode)); + if (start_byte >=3D end_byte) + return 0; + + filemap_invalidate_lock(inode->i_mapping); + iomap_write_delalloc_release(inode, start_byte, end_byte, flags, + iomap, ext4_iomap_punch_delalloc); + filemap_invalidate_unlock(inode->i_mapping); + return 0; +} + + +const struct iomap_ops ext4_iomap_buffered_write_ops =3D { + .iomap_begin =3D ext4_iomap_buffered_write_begin, +}; + +const struct iomap_ops ext4_iomap_buffered_da_write_ops =3D { + .iomap_begin =3D ext4_iomap_buffered_da_write_begin, + .iomap_end =3D ext4_iomap_buffered_da_write_end, +}; + const struct iomap_ops ext4_iomap_buffered_read_ops =3D { .iomap_begin =3D ext4_iomap_buffered_read_begin, }; diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 69eb63dde983..b68509505558 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -104,9 +104,13 @@ static const struct fs_parameter_spec ext4_param_specs= []; * -> page lock -> i_data_sem (rw) * * buffered write path: - * sb_start_write -> i_mutex -> mmap_lock - * sb_start_write -> i_mutex -> transaction start -> page lock -> - * i_data_sem (rw) + * sb_start_write -> i_rwsem (w) -> mmap_lock + * - buffer_head path: + * sb_start_write -> i_rwsem (w) -> transaction start -> folio lock -> + * i_data_sem (rw) + * - iomap path: + * sb_start_write -> i_rwsem (w) -> transaction start -> i_data_sem (rw) + * sb_start_write -> i_rwsem (w) -> folio lock * * truncate: * sb_start_write -> i_mutex -> invalidate_lock (w) -> i_mmap_rwsem (w) -> --=20 2.52.0