From nobody Thu Oct 2 05:03:22 2025 Received: from dggsgout12.his.huawei.com (dggsgout12.his.huawei.com [45.249.212.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE3952459C5; Tue, 23 Sep 2025 01:29:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758590975; cv=none; b=NW4ht3d/j8hZZD0uEtxQxqMrxZDcNLJLfwemqODcnzHOyy4dkVQH4w9ecpPi1SM4jox9XLAaLtvJYIjMQMocZ3NwwFWKj8JEWage5zn4G56y37mAY0Frb7d5Y5H6Rlfil7nPsuAO46aCLVia6lQ8o4PnMvSAAJwruNbRuixOaqQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1758590975; c=relaxed/simple; bh=LrKo2ZPNwXvpg8tQQTd6GkgsTqMFGqJZhV3pr2vi1vw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=j4Fz6UCO4XxZNpoAh0whSSMns1Ywdbr4tDWBhvxqzo7fNZed6SqJSvGoWjoCF74/TCkkIbGtzyluCEJBDjp0b1MpE1IT6sh6hGrVpTVhEp5yNOl7lLSv31PdPYjJQZGwr9aJqhg+RD3uV3tDlE0EcHoIGOBBhNLYuEu2/xuzoaI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.163.216]) by dggsgout12.his.huawei.com (SkyGuard) with ESMTPS id 4cW2Sw36fXzKHMjk; Tue, 23 Sep 2025 09:29:20 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id E40671A1AFF; Tue, 23 Sep 2025 09:29:24 +0800 (CST) Received: from huaweicloud.com (unknown [10.50.85.155]) by APP4 (Coremail) with SMTP id gCh0CgAXKWHq99FoGYYGAg--.10941S15; Tue, 23 Sep 2025 09:29:24 +0800 (CST) From: Zhang Yi To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, libaokun1@huawei.com, yukuai3@huawei.com, yangerkun@huawei.com Subject: [PATCH 11/13] ext4: switch to using the new extent movement method Date: Tue, 23 Sep 2025 09:27:21 +0800 Message-ID: <20250923012724.2378858-12-yi.zhang@huaweicloud.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250923012724.2378858-1-yi.zhang@huaweicloud.com> References: <20250923012724.2378858-1-yi.zhang@huaweicloud.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-CM-TRANSID: gCh0CgAXKWHq99FoGYYGAg--.10941S15 X-Coremail-Antispam: 1UD129KBjvAXoWfGw1UJry7Wr4UtFW5XF4DCFg_yoW8Wr4fXo WfCF4jqwn5Wr9Ig3ykKw10yFyUXan7Jw4rJrWrursrWFy3X3W5C39xG3Z7Ja43Xa1rKr15 Xa4xJ3WYyrZ7trn3n29KB7ZKAUJUUUU8529EdanIXcx71UUUUU7v73VFW2AGmfu7bjvjm3 AaLaJ3UjIYCTnIWjp_UUUOV7AC8VAFwI0_Wr0E3s1l1xkIjI8I6I8E6xAIw20EY4v20xva j40_Wr0E3s1l1IIY67AEw4v_Jr0_Jr4l82xGYIkIc2x26280x7IE14v26r126s0DM28Irc Ia0xkI8VCY1x0267AKxVW5JVCq3wA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK021l 84ACjcxK6xIIjxv20xvE14v26w1j6s0DM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4UJV WxJr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_GcCE 3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E2I x0cI8IcVAFwI0_JrI_JrylYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWUJVW8 JwACjcxG0xvY0x0EwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lFIxGxcIEc7CjxVA2Y2 ka0xkIwI1lc7CjxVAaw2AFwI0_Jw0_GFyl42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Y z7v_Jr0_Gr1lx2IqxVAqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zV AF1VAY17CE14v26r1q6r43MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Gr0_Xr1l IxAIcVC0I7IYx2IY6xkF7I0E14v26r4UJVWxJr1lIxAIcVCF04k26cxKx2IYs7xG6r1j6r 1xMIIF0xvEx4A2jsIE14v26r4j6F4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Gr1j6F4UJbIY CTnIWIevJa73UjIFyTuYvjfUriihUUUUU X-CM-SenderInfo: d1lo6xhdqjqx5xdzvxpfor3voofrz/ Content-Type: text/plain; charset="utf-8" From: Zhang Yi Now that we have mext_move_extent(), we can switch to this new interface and deprecate move_extent_per_page(). First, after acquiring the i_rwsem, we can directly use ext4_map_blocks() to obtain a contiguous extent from the original inode as the extent to be moved. It can and it's safe to get mapping information from the extent status tree without needing to access the ondisk extent tree, because ext4_move_extent() will check the sequence cookie under the folio lock. Then, after populating the mext_data structure, we call ext4_move_extent() to move the extent. Finally, the length of the extent will be adjusted in mext.orig_map.m_len and the actual length moved is returned through m_len. Signed-off-by: Zhang Yi --- fs/ext4/move_extent.c | 386 +++++------------------------------------- 1 file changed, 42 insertions(+), 344 deletions(-) diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c index 4edb9a378db7..b478631e243c 100644 --- a/fs/ext4/move_extent.c +++ b/fs/ext4/move_extent.c @@ -20,29 +20,6 @@ struct mext_data { ext4_lblk_t donor_lblk; /* Start block of the donor file */ }; =20 -/** - * get_ext_path() - Find an extent path for designated logical block numbe= r. - * @inode: inode to be searched - * @lblock: logical block number to find an extent path - * @path: pointer to an extent path - * - * ext4_find_extent wrapper. Return an extent path pointer on success, - * or an error pointer on failure. - */ -static inline struct ext4_ext_path * -get_ext_path(struct inode *inode, ext4_lblk_t lblock, - struct ext4_ext_path *path) -{ - path =3D ext4_find_extent(inode, lblock, path, EXT4_EX_NOCACHE); - if (IS_ERR(path)) - return path; - if (path[ext_depth(inode)].p_ext =3D=3D NULL) { - ext4_free_ext_path(path); - return ERR_PTR(-ENODATA); - } - return path; -} - /** * ext4_double_down_write_data_sem() - write lock two inodes's i_data_sem * @first: inode to be locked @@ -59,7 +36,6 @@ ext4_double_down_write_data_sem(struct inode *first, stru= ct inode *second) } else { down_write(&EXT4_I(second)->i_data_sem); down_write_nested(&EXT4_I(first)->i_data_sem, I_DATA_SEM_OTHER); - } } =20 @@ -78,42 +54,6 @@ ext4_double_up_write_data_sem(struct inode *orig_inode, up_write(&EXT4_I(donor_inode)->i_data_sem); } =20 -/** - * mext_check_coverage - Check that all extents in range has the same type - * - * @inode: inode in question - * @from: block offset of inode - * @count: block count to be checked - * @unwritten: extents expected to be unwritten - * @err: pointer to save error value - * - * Return 1 if all extents in range has expected type, and zero otherwise. - */ -static int -mext_check_coverage(struct inode *inode, ext4_lblk_t from, ext4_lblk_t cou= nt, - int unwritten, int *err) -{ - struct ext4_ext_path *path =3D NULL; - struct ext4_extent *ext; - int ret =3D 0; - ext4_lblk_t last =3D from + count; - while (from < last) { - path =3D get_ext_path(inode, from, path); - if (IS_ERR(path)) { - *err =3D PTR_ERR(path); - return ret; - } - ext =3D path[ext_depth(inode)].p_ext; - if (unwritten !=3D ext4_ext_is_unwritten(ext)) - goto out; - from +=3D ext4_ext_get_actual_len(ext); - } - ret =3D 1; -out: - ext4_free_ext_path(path); - return ret; -} - /** * mext_folio_double_lock - Grab and lock folio on both @inode1 and @inode2 * @@ -363,7 +303,7 @@ static int mext_folio_mkwrite(struct inode *inode, stru= ct folio *folio, * the replaced block count through m_len. Return 0 on success, and an err= or * code otherwise. */ -static __used int mext_move_extent(struct mext_data *mext, u64 *m_len) +static int mext_move_extent(struct mext_data *mext, u64 *m_len) { struct inode *orig_inode =3D mext->orig_inode; struct inode *donor_inode =3D mext->donor_inode; @@ -454,210 +394,6 @@ static __used int mext_move_extent(struct mext_data *= mext, u64 *m_len) goto unlock; } =20 -/** - * move_extent_per_page - Move extent data per page - * - * @o_filp: file structure of original file - * @donor_inode: donor inode - * @orig_page_offset: page index on original file - * @donor_page_offset: page index on donor file - * @data_offset_in_page: block index where data swapping starts - * @block_len_in_page: the number of blocks to be swapped - * @unwritten: orig extent is unwritten or not - * @err: pointer to save return value - * - * Save the data in original inode blocks and replace original inode exten= ts - * with donor inode extents by calling ext4_swap_extents(). - * Finally, write out the saved data in new original inode blocks. Return - * replaced block count. - */ -static int -move_extent_per_page(struct file *o_filp, struct inode *donor_inode, - pgoff_t orig_page_offset, pgoff_t donor_page_offset, - int data_offset_in_page, - int block_len_in_page, int unwritten, int *err) -{ - struct inode *orig_inode =3D file_inode(o_filp); - struct folio *folio[2] =3D {NULL, NULL}; - handle_t *handle; - ext4_lblk_t orig_blk_offset, donor_blk_offset; - unsigned long blocksize =3D orig_inode->i_sb->s_blocksize; - unsigned int tmp_data_size, data_size, replaced_size; - int i, err2, jblocks, retries =3D 0; - int replaced_count =3D 0; - int from; - int blocks_per_page =3D PAGE_SIZE >> orig_inode->i_blkbits; - struct super_block *sb =3D orig_inode->i_sb; - struct buffer_head *bh =3D NULL; - - /* - * It needs twice the amount of ordinary journal buffers because - * inode and donor_inode may change each different metadata blocks. - */ -again: - *err =3D 0; - jblocks =3D ext4_meta_trans_blocks(orig_inode, block_len_in_page, - block_len_in_page) * 2; - handle =3D ext4_journal_start(orig_inode, EXT4_HT_MOVE_EXTENTS, jblocks); - if (IS_ERR(handle)) { - *err =3D PTR_ERR(handle); - return 0; - } - - orig_blk_offset =3D orig_page_offset * blocks_per_page + - data_offset_in_page; - - donor_blk_offset =3D donor_page_offset * blocks_per_page + - data_offset_in_page; - - /* Calculate data_size */ - if ((orig_blk_offset + block_len_in_page - 1) =3D=3D - ((orig_inode->i_size - 1) >> orig_inode->i_blkbits)) { - /* Replace the last block */ - tmp_data_size =3D orig_inode->i_size & (blocksize - 1); - /* - * If data_size equal zero, it shows data_size is multiples of - * blocksize. So we set appropriate value. - */ - if (tmp_data_size =3D=3D 0) - tmp_data_size =3D blocksize; - - data_size =3D tmp_data_size + - ((block_len_in_page - 1) << orig_inode->i_blkbits); - } else - data_size =3D block_len_in_page << orig_inode->i_blkbits; - - replaced_size =3D data_size; - - *err =3D mext_folio_double_lock(orig_inode, donor_inode, orig_page_offset, - donor_page_offset, folio); - if (unlikely(*err < 0)) - goto stop_journal; - /* - * If orig extent was unwritten it can become initialized - * at any time after i_data_sem was dropped, in order to - * serialize with delalloc we have recheck extent while we - * hold page's lock, if it is still the case data copy is not - * necessary, just swap data blocks between orig and donor. - */ - if (unwritten) { - ext4_double_down_write_data_sem(orig_inode, donor_inode); - /* If any of extents in range became initialized we have to - * fallback to data copying */ - unwritten =3D mext_check_coverage(orig_inode, orig_blk_offset, - block_len_in_page, 1, err); - if (*err) - goto drop_data_sem; - - unwritten &=3D mext_check_coverage(donor_inode, donor_blk_offset, - block_len_in_page, 1, err); - if (*err) - goto drop_data_sem; - - if (!unwritten) { - ext4_double_up_write_data_sem(orig_inode, donor_inode); - goto data_copy; - } - if (!filemap_release_folio(folio[0], 0) || - !filemap_release_folio(folio[1], 0)) { - *err =3D -EBUSY; - goto drop_data_sem; - } - replaced_count =3D ext4_swap_extents(handle, orig_inode, - donor_inode, orig_blk_offset, - donor_blk_offset, - block_len_in_page, 1, err); - drop_data_sem: - ext4_double_up_write_data_sem(orig_inode, donor_inode); - goto unlock_folios; - } -data_copy: - from =3D offset_in_folio(folio[0], - orig_blk_offset << orig_inode->i_blkbits); - *err =3D mext_folio_mkuptodate(folio[0], from, from + replaced_size); - if (*err) - goto unlock_folios; - - /* At this point all buffers in range are uptodate, old mapping layout - * is no longer required, try to drop it now. */ - if (!filemap_release_folio(folio[0], 0) || - !filemap_release_folio(folio[1], 0)) { - *err =3D -EBUSY; - goto unlock_folios; - } - ext4_double_down_write_data_sem(orig_inode, donor_inode); - replaced_count =3D ext4_swap_extents(handle, orig_inode, donor_inode, - orig_blk_offset, donor_blk_offset, - block_len_in_page, 1, err); - ext4_double_up_write_data_sem(orig_inode, donor_inode); - if (*err) { - if (replaced_count) { - block_len_in_page =3D replaced_count; - replaced_size =3D - block_len_in_page << orig_inode->i_blkbits; - } else - goto unlock_folios; - } - /* Perform all necessary steps similar write_begin()/write_end() - * but keeping in mind that i_size will not change */ - bh =3D folio_buffers(folio[0]); - if (!bh) - bh =3D create_empty_buffers(folio[0], - 1 << orig_inode->i_blkbits, 0); - for (i =3D 0; i < from >> orig_inode->i_blkbits; i++) - bh =3D bh->b_this_page; - for (i =3D 0; i < block_len_in_page; i++) { - *err =3D ext4_get_block(orig_inode, orig_blk_offset + i, bh, 0); - if (*err < 0) - goto repair_branches; - bh =3D bh->b_this_page; - } - - block_commit_write(folio[0], from, from + replaced_size); - - /* Even in case of data=3Dwriteback it is reasonable to pin - * inode to transaction, to prevent unexpected data loss */ - *err =3D ext4_jbd2_inode_add_write(handle, orig_inode, - (loff_t)orig_page_offset << PAGE_SHIFT, replaced_size); - -unlock_folios: - folio_unlock(folio[0]); - folio_put(folio[0]); - folio_unlock(folio[1]); - folio_put(folio[1]); -stop_journal: - ext4_journal_stop(handle); - if (*err =3D=3D -ENOSPC && - ext4_should_retry_alloc(sb, &retries)) - goto again; - /* Buffer was busy because probably is pinned to journal transaction, - * force transaction commit may help to free it. */ - if (*err =3D=3D -EBUSY && retries++ < 4 && EXT4_SB(sb)->s_journal && - jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal)) - goto again; - return replaced_count; - -repair_branches: - /* - * This should never ever happen! - * Extents are swapped already, but we are not able to copy data. - * Try to swap extents to it's original places - */ - ext4_double_down_write_data_sem(orig_inode, donor_inode); - replaced_count =3D ext4_swap_extents(handle, donor_inode, orig_inode, - orig_blk_offset, donor_blk_offset, - block_len_in_page, 0, &err2); - ext4_double_up_write_data_sem(orig_inode, donor_inode); - if (replaced_count !=3D block_len_in_page) { - ext4_error_inode_block(orig_inode, (sector_t)(orig_blk_offset), - EIO, "Unable to copy data block," - " data will be lost."); - *err =3D -EIO; - } - replaced_count =3D 0; - goto unlock_folios; -} - /* * Check the validity of the basic filesystem environment and the * inodes' support status. @@ -819,106 +555,72 @@ static int mext_check_adjust_range(struct inode *ori= g_inode, * * This function returns 0 and moved block length is set in moved_len * if succeed, otherwise returns error value. - * */ -int -ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk, - __u64 donor_blk, __u64 len, __u64 *moved_len) +int ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig= _blk, + __u64 donor_blk, __u64 len, __u64 *moved_len) { struct inode *orig_inode =3D file_inode(o_filp); struct inode *donor_inode =3D file_inode(d_filp); - struct ext4_ext_path *path =3D NULL; - int blocks_per_page =3D PAGE_SIZE >> orig_inode->i_blkbits; - ext4_lblk_t o_end, o_start =3D orig_blk; - ext4_lblk_t d_start =3D donor_blk; + struct mext_data mext; + struct super_block *sb =3D orig_inode->i_sb; + struct ext4_sb_info *sbi =3D EXT4_SB(sb); + int retries =3D 0; + u64 m_len; int ret; =20 + *moved_len =3D 0; + /* Protect orig and donor inodes against a truncate */ lock_two_nondirectories(orig_inode, donor_inode); =20 ret =3D mext_check_validity(orig_inode, donor_inode); if (ret) - goto unlock; + goto out; =20 /* Wait for all existing dio workers */ inode_dio_wait(orig_inode); inode_dio_wait(donor_inode); =20 - /* Protect extent tree against block allocations via delalloc */ - ext4_double_down_write_data_sem(orig_inode, donor_inode); /* Check and adjust the specified move_extent range. */ ret =3D mext_check_adjust_range(orig_inode, donor_inode, orig_blk, donor_blk, &len); if (ret) goto out; - o_end =3D o_start + len; =20 - *moved_len =3D 0; - while (o_start < o_end) { - struct ext4_extent *ex; - ext4_lblk_t cur_blk, next_blk; - pgoff_t orig_page_index, donor_page_index; - int offset_in_page; - int unwritten, cur_len; - - path =3D get_ext_path(orig_inode, o_start, path); - if (IS_ERR(path)) { - ret =3D PTR_ERR(path); + mext.orig_inode =3D orig_inode; + mext.donor_inode =3D donor_inode; + while (len) { + mext.orig_map.m_lblk =3D orig_blk; + mext.orig_map.m_len =3D len; + mext.orig_map.m_flags =3D 0; + mext.donor_lblk =3D donor_blk; + + ret =3D ext4_map_blocks(NULL, orig_inode, &mext.orig_map, 0); + if (ret < 0) goto out; - } - ex =3D path[path->p_depth].p_ext; - cur_blk =3D le32_to_cpu(ex->ee_block); - cur_len =3D ext4_ext_get_actual_len(ex); - /* Check hole before the start pos */ - if (cur_blk + cur_len - 1 < o_start) { - next_blk =3D ext4_ext_next_allocated_block(path); - if (next_blk =3D=3D EXT_MAX_BLOCKS) { - ret =3D -ENODATA; - goto out; - } - d_start +=3D next_blk - o_start; - o_start =3D next_blk; - continue; - /* Check hole after the start pos */ - } else if (cur_blk > o_start) { - /* Skip hole */ - d_start +=3D cur_blk - o_start; - o_start =3D cur_blk; - /* Extent inside requested range ?*/ - if (cur_blk >=3D o_end) + + /* Skip moving if it is a hole or a delalloc extent. */ + if (mext.orig_map.m_flags & + (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) { + ret =3D mext_move_extent(&mext, &m_len); + if (ret =3D=3D -ESTALE) + continue; + if (ret =3D=3D -ENOSPC && + ext4_should_retry_alloc(sb, &retries)) + continue; + if (ret =3D=3D -EBUSY && + sbi->s_journal && retries++ < 4 && + jbd2_journal_force_commit_nested(sbi->s_journal)) + continue; + if (ret) goto out; - } else { /* in_range(o_start, o_blk, o_len) */ - cur_len +=3D cur_blk - o_start; + + *moved_len +=3D m_len; + retries =3D 0; } - unwritten =3D ext4_ext_is_unwritten(ex); - if (o_end - o_start < cur_len) - cur_len =3D o_end - o_start; - - orig_page_index =3D o_start >> (PAGE_SHIFT - - orig_inode->i_blkbits); - donor_page_index =3D d_start >> (PAGE_SHIFT - - donor_inode->i_blkbits); - offset_in_page =3D o_start % blocks_per_page; - if (cur_len > blocks_per_page - offset_in_page) - cur_len =3D blocks_per_page - offset_in_page; - /* - * Up semaphore to avoid following problems: - * a. transaction deadlock among ext4_journal_start, - * ->write_begin via pagefault, and jbd2_journal_commit - * b. racing with ->read_folio, ->write_begin, and - * ext4_get_block in move_extent_per_page - */ - ext4_double_up_write_data_sem(orig_inode, donor_inode); - /* Swap original branches with new branches */ - *moved_len +=3D move_extent_per_page(o_filp, donor_inode, - orig_page_index, donor_page_index, - offset_in_page, cur_len, - unwritten, &ret); - ext4_double_down_write_data_sem(orig_inode, donor_inode); - if (ret < 0) - break; - o_start +=3D cur_len; - d_start +=3D cur_len; + orig_blk +=3D mext.orig_map.m_len; + donor_blk +=3D mext.orig_map.m_len; + len -=3D mext.orig_map.m_len; } =20 out: @@ -927,10 +629,6 @@ ext4_move_extents(struct file *o_filp, struct file *d_= filp, __u64 orig_blk, ext4_discard_preallocations(donor_inode); } =20 - ext4_free_ext_path(path); - ext4_double_up_write_data_sem(orig_inode, donor_inode); -unlock: unlock_two_nondirectories(orig_inode, donor_inode); - return ret; } --=20 2.46.1