From nobody Wed Oct 8 23:44:11 2025 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D449376; Mon, 23 Jun 2025 07:46:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664820; cv=none; b=EzQSmDZyHptLk3cqI0pv2Bq3sg0reLOKE5Gma+yVXltSg9hhRhlu5v+og2wbvylkiKnxN/74ov7++YwJDlVMt15ggbDINiEcBqqUHrQyF0/NgTjiCcz4GICmEBnJ10uNxLJdFT+ATI8IJqW60KDCZ0fih1qJlt+GCxTHn1JxVhI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664820; c=relaxed/simple; bh=wSNFclUCUsCTmMmziPFxFmR+/4LVux548FVUQDvdo48=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=RcfCu2JLanB7YV6Aod155YFjwCQTZkhh9JkQ7KZxcB2iEBo+x3DxB4TDMp8KV+0kDOEq8yKv/2p7GxXDAP31WGM3iQ0vQ+em7GY6FhPM9qyQ0pWCZZeRrb1cQ5BSG/l7kOngbIWQQdVq4lPAjCL1rVrFEVpxx09Jc24E+Xt0+so= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.194]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4bQg6504sqzTgs9; Mon, 23 Jun 2025 15:42:37 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 94AB514027D; Mon, 23 Jun 2025 15:46:55 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:46:54 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 01/16] ext4: add ext4_try_lock_group() to skip busy groups Date: Mon, 23 Jun 2025 15:32:49 +0800 Message-ID: <20250623073304.3275702-2-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" When ext4 allocates blocks, we used to just go through the block groups one by one to find a good one. But when there are tons of block groups (like hundreds of thousands or even millions) and not many have free space (meaning they're mostly full), it takes a really long time to check them all, and performance gets bad. So, we added the "mb_optimize_scan" mount option (which is on by default now). It keeps track of some group lists, so when we need a free block, we can just grab a likely group from the right list. This saves time and makes block allocation much faster. But when multiple processes or containers are doing similar things, like constantly allocating 8k blocks, they all try to use the same block group in the same list. Even just two processes doing this can cut the IOPS in half. For example, one container might do 300,000 IOPS, but if you run two at the same time, the total is only 150,000. Since we can already look at block groups in a non-linear way, the first and last groups in the same list are basically the same for finding a block right now. Therefore, add an ext4_try_lock_group() helper function to skip the current group when it is locked by another process, thereby avoiding contention with other processes. This helps ext4 make better use of having multiple block groups. Also, to make sure we don't skip all the groups that have free space when allocating blocks, we won't try to skip busy groups anymore when ac_criteria is CR_ANY_FREE. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 | Disk: 960GB SSD |-------------------------|-------------------------| | base | patched | base | patched | Reviewed-by: Jan Kara -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 2667 | 4821 (+80.7%) | 3450 | 15371 (+345%) | mb_optimize_scan=3D1 | 2643 | 4784 (+81.0%) | 3209 | 6101 (+90.0%) | Signed-off-by: Baokun Li --- fs/ext4/ext4.h | 23 ++++++++++++++--------- fs/ext4/mballoc.c | 19 ++++++++++++++++--- 2 files changed, 30 insertions(+), 12 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 18373de980f2..9df74123e7e6 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3541,23 +3541,28 @@ static inline int ext4_fs_is_busy(struct ext4_sb_in= fo *sbi) return (atomic_read(&sbi->s_lock_busy) > EXT4_CONTENTION_THRESHOLD); } =20 +static inline bool ext4_try_lock_group(struct super_block *sb, ext4_group_= t group) +{ + if (!spin_trylock(ext4_group_lock_ptr(sb, group))) + return false; + /* + * We're able to grab the lock right away, so drop the lock + * contention counter. + */ + atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); + return true; +} + static inline void ext4_lock_group(struct super_block *sb, ext4_group_t gr= oup) { - spinlock_t *lock =3D ext4_group_lock_ptr(sb, group); - if (spin_trylock(lock)) - /* - * We're able to grab the lock right away, so drop the - * lock contention counter. - */ - atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, -1, 0); - else { + if (!ext4_try_lock_group(sb, group)) { /* * The lock is busy, so bump the contention counter, * and then wait on the spin lock. */ atomic_add_unless(&EXT4_SB(sb)->s_lock_busy, 1, EXT4_MAX_CONTENTION); - spin_lock(lock); + spin_lock(ext4_group_lock_ptr(sb, group)); } } =20 diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 1e98c5be4e0a..336d65c4f6a2 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -896,7 +896,8 @@ static void ext4_mb_choose_next_group_p2_aligned(struct= ext4_allocation_context bb_largest_free_order_node) { if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]); - if (likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { + if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && + likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { *group =3D iter->bb_group; ac->ac_flags |=3D EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); @@ -932,7 +933,8 @@ ext4_mb_find_good_group_avg_frag_lists(struct ext4_allo= cation_context *ac, int o list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) { if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); - if (likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { + if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && + likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { grp =3D iter; break; } @@ -2899,6 +2901,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) nr, &prefetch_ios); } =20 + /* prevent unnecessary buddy loading. */ + if (cr < CR_ANY_FREE && + spin_is_locked(ext4_group_lock_ptr(sb, group))) + continue; + /* This now checks without needing the buddy page */ ret =3D ext4_mb_good_group_nolock(ac, group, cr); if (ret <=3D 0) { @@ -2911,7 +2918,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) if (err) goto out; =20 - ext4_lock_group(sb, group); + /* skip busy group */ + if (cr >=3D CR_ANY_FREE) { + ext4_lock_group(sb, group); + } else if (!ext4_try_lock_group(sb, group)) { + ext4_mb_unload_buddy(&e4b); + continue; + } =20 /* * We need to check again after locking the --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D575872625; Mon, 23 Jun 2025 07:46:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664821; cv=none; b=BagN769DEnCpL9fg1lb3VVdY2Xv6T5InRj+G9AS0torZpQEpAvy/NbOtsNDK/VbHXVBbaefF6M+wvxxeJxw73MHF/iGZ1iofUmiEGuhbejoOzKug9sS+p3uzkHpzeF5RF4AL924cMA0d6fvSDQojCQFKvaoVYch9WwBnO4PmsCM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664821; c=relaxed/simple; bh=Dn8aKeCp2MqLWglq2wjerggE0C1t22uLrSme0DTgoeE=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=dG74/hzxuYMe3+2QGNYEiCFzSQWsGsU6laANvzDtwZQJDDDCft672EPI+kUfrdjCjV7G8xF1X+1C+9E5kjbueg90ij7mx/iO+1uA+Hqe/U3YHUYSLV+leoBVOi/IwJTnlTNTFa0z8ZfXvpz22ytBwZlRdGW9hS0Z9TQl0dyQNn4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.163]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4bQgD717Ssz2QVJ9; Mon, 23 Jun 2025 15:47:51 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 6228418005F; Mon, 23 Jun 2025 15:46:56 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:46:55 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 02/16] ext4: remove unnecessary s_mb_last_start Date: Mon, 23 Jun 2025 15:32:50 +0800 Message-ID: <20250623073304.3275702-3-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" ac->ac_g_ex.fe_start is only used in ext4_mb_find_by_goal(), but STREAM ALLOC is activated after ext4_mb_find_by_goal() fails, so there's no need to update ac->ac_g_ex.fe_start, remove the unnecessary s_mb_last_start. Signed-off-by: Baokun Li Reviewed-by: Jan Kara --- fs/ext4/ext4.h | 1 - fs/ext4/mballoc.c | 2 -- 2 files changed, 3 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9df74123e7e6..cfb60f8fbb63 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1631,7 +1631,6 @@ struct ext4_sb_info { unsigned int s_max_dir_size_kb; /* where last allocation was done - for stream allocation */ unsigned long s_mb_last_group; - unsigned long s_mb_last_start; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 336d65c4f6a2..5cdae3bda072 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2171,7 +2171,6 @@ static void ext4_mb_use_best_found(struct ext4_alloca= tion_context *ac, if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { spin_lock(&sbi->s_md_lock); sbi->s_mb_last_group =3D ac->ac_f_ex.fe_group; - sbi->s_mb_last_start =3D ac->ac_f_ex.fe_start; spin_unlock(&sbi->s_md_lock); } /* @@ -2849,7 +2848,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) /* TBD: may be hot point */ spin_lock(&sbi->s_md_lock); ac->ac_g_ex.fe_group =3D sbi->s_mb_last_group; - ac->ac_g_ex.fe_start =3D sbi->s_mb_last_start; spin_unlock(&sbi->s_md_lock); } =20 --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B52581388; Mon, 23 Jun 2025 07:47:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.191 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664823; cv=none; b=mc4enlcAOb+wgeaFIj3kSjSIGyVoXLgI9fRo2EcwJWBdYdQ1FTQUKe8D5qxvF2FSsBDeVhYqYhXM3/5vGCAsPbCsOebStIxrZOm5VfrFRDLwO5ALqX0/iA6TP7QZto19/WJKhq0/tKDy/H2ySk3oNeujyLUYL+kBHLfPC4hNy1s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664823; c=relaxed/simple; bh=jsK+9dugOTCQLSNIxqMiLiZ14gIE22FtPTOf9JWnmqM=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=EE7A2f/+bGO2BfbOVQx99wU8kV98UKoCOsRNr5rFpAdGt/2oYdgXvb5lPOJNff/JgmxhvLHJgf6ePG77Wk5EmVFvHMBhtL6PvsAQxTfCKvQkH4MHRiP9HUrgMQO0g5MMJj21q8aygczFUrDtgIM4hB/8QVQebfXbcR+RUMHgk64= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.191 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.17]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4bQg9H0Rcvz2BdVh; Mon, 23 Jun 2025 15:45:23 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 3AA441A0188; Mon, 23 Jun 2025 15:46:57 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:46:56 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 03/16] ext4: remove unnecessary s_md_lock on update s_mb_last_group Date: Mon, 23 Jun 2025 15:32:51 +0800 Message-ID: <20250623073304.3275702-4-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" After we optimized the block group lock, we found another lock contention issue when running will-it-scale/fallocate2 with multiple processes. The fallocate's block allocation and the truncate's block release were fighting over the s_md_lock. The problem is, this lock protects totally different things in those two processes: the list of freed data blocks (s_freed_data_list) when releasing, and where to start looking for new blocks (mb_last_group) when allocating. Now we only need to track s_mb_last_group and no longer need to track s_mb_last_start, so we don't need the s_md_lock lock to ensure that the two are consistent, and we can ensure that the s_mb_last_group read is up to date by using smp_store_release/smp_load_acquire. Besides, the s_mb_last_group data type only requires ext4_group_t (i.e., unsigned int), rendering unsigned long superfluous. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 | Disk: 960GB SSD |-------------------------|-------------------------| | base | patched | base | patched | -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 4821 | 7612 (+57.8%) | 15371 | 21647 (+40.8%) | mb_optimize_scan=3D1 | 4784 | 7568 (+58.1%) | 6101 | 9117 (+49.4%) | Signed-off-by: Baokun Li --- fs/ext4/ext4.h | 2 +- fs/ext4/mballoc.c | 17 ++++++----------- 2 files changed, 7 insertions(+), 12 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index cfb60f8fbb63..93f03d8c3dca 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1630,7 +1630,7 @@ struct ext4_sb_info { unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; /* where last allocation was done - for stream allocation */ - unsigned long s_mb_last_group; + ext4_group_t s_mb_last_group; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 5cdae3bda072..3f103919868b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2168,11 +2168,9 @@ static void ext4_mb_use_best_found(struct ext4_alloc= ation_context *ac, ac->ac_buddy_folio =3D e4b->bd_buddy_folio; folio_get(ac->ac_buddy_folio); /* store last allocated for subsequent stream allocation */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - spin_lock(&sbi->s_md_lock); - sbi->s_mb_last_group =3D ac->ac_f_ex.fe_group; - spin_unlock(&sbi->s_md_lock); - } + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) + /* pairs with smp_load_acquire in ext4_mb_regular_allocator() */ + smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group); /* * As we've just preallocated more space than * user requested originally, we store allocated @@ -2844,12 +2842,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) } =20 /* if stream allocation is enabled, use global goal */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { - /* TBD: may be hot point */ - spin_lock(&sbi->s_md_lock); - ac->ac_g_ex.fe_group =3D sbi->s_mb_last_group; - spin_unlock(&sbi->s_md_lock); - } + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) + /* pairs with smp_store_release in ext4_mb_use_best_found() */ + ac->ac_g_ex.fe_group =3D smp_load_acquire(&sbi->s_mb_last_group); =20 /* * Let's just scan groups to find more-less suitable blocks We --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DFC241F6667; Mon, 23 Jun 2025 07:47:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.190 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664823; cv=none; b=kq/2A9mQHKjbZLP1llrfxru6VATpX8POQMAFywdUtndcuTxA9D8RyswZO4ppjdDpFyY5QqmGqx7LICNfLQhlwddG2zbBd1Bvg9zCBCE8U4+c56P8j4f1HE/FqOg13RDzO3IDWITEIquXhwrKny6pJg1BRzJ2sCh8k0eaRgI+7Vo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664823; c=relaxed/simple; bh=sYtqUFt4brvDA+yF4dcv6k3g1PaR0DyYG66HfGgJ7PQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UuW4p3KOhL2MRSlJzjrTcz8K4IDeJM/aYz5AAAQXfuRcRaIh6vjD4j/scx+d7QG7fJa6G5aFTsrL9/Y2bfkI59UYMu5+jHZrO/6UwFMtHIynf9a1LZHu/X1v/N7ud2wi5sBSeb91+WNnaVK1fHptKzoDoTT87JGeIwUDVnmZTcs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.190 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4bQg9H39fFz2TSJG; Mon, 23 Jun 2025 15:45:23 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 1A0501401F2; Mon, 23 Jun 2025 15:46:58 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:46:57 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention Date: Mon, 23 Jun 2025 15:32:52 +0800 Message-ID: <20250623073304.3275702-5-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) When allocating data blocks, if the first try (goal allocation) fails and stream allocation is on, it tries a global goal starting from the last group we used (s_mb_last_group). This helps cluster large files together to reduce free space fragmentation, and the data block contiguity also accelerates write-back to disk. However, when multiple processes allocate blocks,=C2=A0having just one glob= al goal means they all fight over the same group. This drastically lowers the chances of extents merging and leads to much worse file fragmentation. To mitigate this multi-process contention, we now employ multiple global goals, with the number of goals being the CPU count rounded up to the nearest power of 2. To ensure a consistent goal for each inode, we select the corresponding goal by taking the inode number modulo the total number of goals. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 | Disk: 960GB SSD |-------------------------|-------------------------| | base | patched | base | patched | -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 7612 | 19699 (+158%) | 21647 | 53093 (+145%) | mb_optimize_scan=3D1 | 7568 | 9862 (+30.3%) | 9117 | 14401 (+57.9%) | Signed-off-by: Baokun Li --- fs/ext4/ext4.h | 2 +- fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++------- fs/ext4/mballoc.h | 9 +++++++++ 3 files changed, 34 insertions(+), 8 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 93f03d8c3dca..c3f16aba7b79 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1630,7 +1630,7 @@ struct ext4_sb_info { unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; /* where last allocation was done - for stream allocation */ - ext4_group_t s_mb_last_group; + ext4_group_t *s_mb_last_groups; unsigned int s_mb_prefetch; unsigned int s_mb_prefetch_limit; unsigned int s_mb_best_avail_max_trim_order; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 3f103919868b..216b332a5054 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2168,9 +2168,12 @@ static void ext4_mb_use_best_found(struct ext4_alloc= ation_context *ac, ac->ac_buddy_folio =3D e4b->bd_buddy_folio; folio_get(ac->ac_buddy_folio); /* store last allocated for subsequent stream allocation */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) - /* pairs with smp_load_acquire in ext4_mb_regular_allocator() */ - smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group); + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { + int hash =3D ac->ac_inode->i_ino % MB_LAST_GROUPS; + /* Pairs with smp_load_acquire in ext4_mb_regular_allocator() */ + smp_store_release(&sbi->s_mb_last_groups[hash], + ac->ac_f_ex.fe_group); + } /* * As we've just preallocated more space than * user requested originally, we store allocated @@ -2842,9 +2845,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) } =20 /* if stream allocation is enabled, use global goal */ - if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) - /* pairs with smp_store_release in ext4_mb_use_best_found() */ - ac->ac_g_ex.fe_group =3D smp_load_acquire(&sbi->s_mb_last_group); + if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) { + int hash =3D ac->ac_inode->i_ino % MB_LAST_GROUPS; + /* Pairs with smp_store_release in ext4_mb_use_best_found() */ + ac->ac_g_ex.fe_group =3D smp_load_acquire( + &sbi->s_mb_last_groups[hash]); + } =20 /* * Let's just scan groups to find more-less suitable blocks We @@ -3715,10 +3721,17 @@ int ext4_mb_init(struct super_block *sb) sbi->s_mb_group_prealloc, EXT4_NUM_B2C(sbi, sbi->s_stripe)); } =20 + sbi->s_mb_last_groups =3D kcalloc(MB_LAST_GROUPS, sizeof(ext4_group_t), + GFP_KERNEL); + if (sbi->s_mb_last_groups =3D=3D NULL) { + ret =3D -ENOMEM; + goto out; + } + sbi->s_locality_groups =3D alloc_percpu(struct ext4_locality_group); if (sbi->s_locality_groups =3D=3D NULL) { ret =3D -ENOMEM; - goto out; + goto out_free_last_groups; } for_each_possible_cpu(i) { struct ext4_locality_group *lg; @@ -3743,6 +3756,9 @@ int ext4_mb_init(struct super_block *sb) out_free_locality_groups: free_percpu(sbi->s_locality_groups); sbi->s_locality_groups =3D NULL; +out_free_last_groups: + kvfree(sbi->s_mb_last_groups); + sbi->s_mb_last_groups =3D NULL; out: kfree(sbi->s_mb_avg_fragment_size); kfree(sbi->s_mb_avg_fragment_size_locks); @@ -3847,6 +3863,7 @@ void ext4_mb_release(struct super_block *sb) } =20 free_percpu(sbi->s_locality_groups); + kvfree(sbi->s_mb_last_groups); } =20 static inline int ext4_issue_discard(struct super_block *sb, diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index f8280de3e882..38c37901728d 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -97,6 +97,15 @@ */ #define MB_NUM_ORDERS(sb) ((sb)->s_blocksize_bits + 2) =20 +/* + * Number of mb last groups + */ +#ifdef CONFIG_SMP +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids) +#else +#define MB_LAST_GROUPS 1 +#endif + struct ext4_free_data { /* this links the free block information from sb_info */ struct list_head efd_list; --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 36C3D1FCFE7; Mon, 23 Jun 2025 07:47:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664824; cv=none; b=DPgWzW3G1fXoa/BK2s30wCODCVJEL/p2VVMPZgv76DQXqTfiVxgUk2k2BEA3Dmhj2olZZZVeKffbMg3dyR+VeUN/cwZwBrJ3bCWzCU4tEmNEWHziQpc+MkYLUBM7DZSQVIooJ56HPeYOZzKuVcFX30Fp1VE6rTgp6b2Wu7hqrCY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664824; c=relaxed/simple; bh=LPQSuE8BTwF8bSIOyxlSqSDoaQ7gvIZGhzPOCn2C3E4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=FFbEbpdWeb4F8vBusHoCQJXaNbQB5c0N2x+GWqHY9XAxoUAIkMGs/ONelSjyShLwJQuMoXJSCl9ioSPiXicZrUsNIh61g5uzw/h2L1BxwFPspAl1NsK2l7NEWQW1GIXqYXUuGHr1btLY8F+Pc+MxQjny4gMH+gWxTR/M5g9whXY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4bQg6Z0JFWzPt6h; Mon, 23 Jun 2025 15:43:02 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id DFCD81400DC; Mon, 23 Jun 2025 15:46:58 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:46:58 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 05/16] ext4: get rid of some obsolete EXT4_MB_HINT flags Date: Mon, 23 Jun 2025 15:32:53 +0800 Message-ID: <20250623073304.3275702-6-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Since nobody has used these EXT4_MB_HINT flags for ages, let's remove them. Signed-off-by: Baokun Li Reviewed-by: Ojaswin Mujoo Reviewed-by: Jan Kara --- fs/ext4/ext4.h | 6 ------ include/trace/events/ext4.h | 3 --- 2 files changed, 9 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index c3f16aba7b79..29b3817f41a5 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -185,14 +185,8 @@ enum criteria { =20 /* prefer goal again. length */ #define EXT4_MB_HINT_MERGE 0x0001 -/* blocks already reserved */ -#define EXT4_MB_HINT_RESERVED 0x0002 -/* metadata is being allocated */ -#define EXT4_MB_HINT_METADATA 0x0004 /* first blocks in the file */ #define EXT4_MB_HINT_FIRST 0x0008 -/* search for the best chunk */ -#define EXT4_MB_HINT_BEST 0x0010 /* data is being allocated */ #define EXT4_MB_HINT_DATA 0x0020 /* don't preallocate (for tails) */ diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h index 156908641e68..33b204165cc0 100644 --- a/include/trace/events/ext4.h +++ b/include/trace/events/ext4.h @@ -23,10 +23,7 @@ struct partial_cluster; =20 #define show_mballoc_flags(flags) __print_flags(flags, "|", \ { EXT4_MB_HINT_MERGE, "HINT_MERGE" }, \ - { EXT4_MB_HINT_RESERVED, "HINT_RESV" }, \ - { EXT4_MB_HINT_METADATA, "HINT_MDATA" }, \ { EXT4_MB_HINT_FIRST, "HINT_FIRST" }, \ - { EXT4_MB_HINT_BEST, "HINT_BEST" }, \ { EXT4_MB_HINT_DATA, "HINT_DATA" }, \ { EXT4_MB_HINT_NOPREALLOC, "HINT_NOPREALLOC" }, \ { EXT4_MB_HINT_GROUP_ALLOC, "HINT_GRP_ALLOC" }, \ --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D6262036FE; Mon, 23 Jun 2025 07:47:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664824; cv=none; b=LqcIqLaO6qoW71gmehjb7W70I1KmwhTwezH2hCchoROnW1Fea2+A4OGzRgJe11TD+ZRP3gcMB+NsugNHiVFgxoS3fkoBgOhb/iUmeyXb28cad3NfvAWdrjHPqUbXmXrR0Lgc6ikH2UXVhnfjmM7OHpGJ0FUl70iY3TfX/818gRU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664824; c=relaxed/simple; bh=MJXHq6WyflIQjmrjhMQd4W5AJVA/rdWGX7YSAixl+MI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=WnHydSt7GoIIUIeZk2srwDpRVu01f45Bi7MCCzy4tGgPo9D5erq6b8HQhoG+YRYg6n7ojbnJhi01poX/QslO2l+lHeCUPYcRh6m8CIFVnt50i+3yJmKpSA2LYf2Kk5uNTQYGF/MCMmhDzuKgAzSFxxM9M2ZMVDCZPXFvcLwzD7U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4bQg5q2ghTz10XKG; Mon, 23 Jun 2025 15:42:23 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id C0EA31400DC; Mon, 23 Jun 2025 15:46:59 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:46:58 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 06/16] ext4: fix typo in CR_GOAL_LEN_SLOW comment Date: Mon, 23 Jun 2025 15:32:54 +0800 Message-ID: <20250623073304.3275702-7-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Remove the superfluous "find_". Signed-off-by: Baokun Li Reviewed-by: Ojaswin Mujoo Reviewed-by: Jan Kara --- fs/ext4/ext4.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 29b3817f41a5..294198c05cdd 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -157,7 +157,7 @@ enum criteria { =20 /* * Reads each block group sequentially, performing disk IO if - * necessary, to find find_suitable block group. Tries to + * necessary, to find suitable block group. Tries to * allocate goal length but might trim the request if nothing * is found after enough tries. */ --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B3721F873B; Mon, 23 Jun 2025 07:47:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.190 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664825; cv=none; b=JNcENpB/nFbK3fQMEEBdWWpwpooheHjItegOTZPXPdBeuq5piihgiSW5ScWN7e0kG5HtddMNd/HrgEQr32QjfOTEgfY2ip2KbCFv1MDg2sfXAw/uU2GO0eL3tyJUyZ6rz6fFAyvleb9/RuXYwbtlbNuL5yVN+lJqlP3qB116PNo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664825; c=relaxed/simple; bh=gJ2vImNIjZMqWgvCr44KX88669L0OzhjtbTYuJvlBVQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=UV+okVkGPDyvcbSmJCsXn1nNrOJeHbN2mgOYUiQPm4Oo6S5qAHyps8B6JPYGlYq0aCDXEdVH+1GadhV38gVpRYWnlaDnvzUh6b+j2E1N9muNisIZEIVHDANAS6Bfr0GGAfqbrInk+V8uqyQkdY/IX/7/D/2DQA9VS7apFfkDPYk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.190 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.17]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4bQg6b72mTz2Cfbr; Mon, 23 Jun 2025 15:43:03 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 966E61A0188; Mon, 23 Jun 2025 15:47:00 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:46:59 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 07/16] ext4: convert sbi->s_mb_free_pending to atomic_t Date: Mon, 23 Jun 2025 15:32:55 +0800 Message-ID: <20250623073304.3275702-8-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Previously, s_md_lock was used to protect s_mb_free_pending during modifications, while smp_mb() ensured fresh reads, so s_md_lock just guarantees the atomicity of s_mb_free_pending. Thus we optimized it by converting s_mb_free_pending into an atomic variable, thereby eliminating s_md_lock and minimizing lock contention. This also prepares for future lockless merging of free extents. Following this modification, s_md_lock is exclusively responsible for managing insertions and deletions within s_freed_data_list, along with operations involving list_splice. Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 | Disk: 960GB SSD |-------------------------|-------------------------| | base | patched | base | patched | Reviewed-by: Jan Kara -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 19699 | 20982 (+6.5%) | 53093 | 50629 (-4.6%) | mb_optimize_scan=3D1 | 9862 | 10703 (+8.5%) | 14401 | 14856 (+3.1%) | Signed-off-by: Baokun Li --- fs/ext4/balloc.c | 2 +- fs/ext4/ext4.h | 2 +- fs/ext4/mballoc.c | 9 +++------ 3 files changed, 5 insertions(+), 8 deletions(-) diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c index c48fd36b2d74..c9329ed5c094 100644 --- a/fs/ext4/balloc.c +++ b/fs/ext4/balloc.c @@ -703,7 +703,7 @@ int ext4_should_retry_alloc(struct super_block *sb, int= *retries) * possible we just missed a transaction commit that did so */ smp_mb(); - if (sbi->s_mb_free_pending =3D=3D 0) { + if (atomic_read(&sbi->s_mb_free_pending) =3D=3D 0) { if (test_opt(sb, DISCARD)) { atomic_inc(&sbi->s_retry_alloc_pending); flush_work(&sbi->s_discard_work); diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 294198c05cdd..003b8d3726e8 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1602,7 +1602,7 @@ struct ext4_sb_info { unsigned short *s_mb_offsets; unsigned int *s_mb_maxs; unsigned int s_group_info_size; - unsigned int s_mb_free_pending; + atomic_t s_mb_free_pending; struct list_head s_freed_data_list[2]; /* List of blocks to be freed after commit completed */ struct list_head s_discard_list; diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 216b332a5054..5410fb3688ee 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -3680,7 +3680,7 @@ int ext4_mb_init(struct super_block *sb) } =20 spin_lock_init(&sbi->s_md_lock); - sbi->s_mb_free_pending =3D 0; + atomic_set(&sbi->s_mb_free_pending, 0); INIT_LIST_HEAD(&sbi->s_freed_data_list[0]); INIT_LIST_HEAD(&sbi->s_freed_data_list[1]); INIT_LIST_HEAD(&sbi->s_discard_list); @@ -3894,10 +3894,7 @@ static void ext4_free_data_in_buddy(struct super_blo= ck *sb, /* we expect to find existing buddy because it's pinned */ BUG_ON(err !=3D 0); =20 - spin_lock(&EXT4_SB(sb)->s_md_lock); - EXT4_SB(sb)->s_mb_free_pending -=3D entry->efd_count; - spin_unlock(&EXT4_SB(sb)->s_md_lock); - + atomic_sub(entry->efd_count, &EXT4_SB(sb)->s_mb_free_pending); db =3D e4b.bd_info; /* there are blocks to put in buddy to make them really free */ count +=3D entry->efd_count; @@ -6392,7 +6389,7 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_b= uddy *e4b, =20 spin_lock(&sbi->s_md_lock); list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->ef= d_tid & 1]); - sbi->s_mb_free_pending +=3D clusters; + atomic_add(clusters, &sbi->s_mb_free_pending); spin_unlock(&sbi->s_md_lock); } =20 --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CDDBD21ADC5; Mon, 23 Jun 2025 07:47:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.191 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664826; cv=none; b=qk9YdaK2Kz8+RGcBskHht/QqcXTEMpokNqh5k02OH0mSoi5Njo1ySd+b6Im5NhrcfNN/t2Q6PC4CnXtI7rjlh/3LcG9SRXMKZkFdKAH+wsvUyER1ZH7Xsn2ylxWFR8WC4YBP3kGuZ/EFPVd3U8twPA69DR2YcnFIrsl5S98sewk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664826; c=relaxed/simple; bh=rKiQucLOQupOFyRbv7D8fr/aZluRofSEBdGCBVJ84Cs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=TJBmDZOOO6U8cEY4c7vq2bxM3YyFZIs97vNWarMoMPg/4gMSilfgnFIqcVHafdEfOktH5iBQGajiY4shbL9GbxNky3gFdmm9/k0GyikjtC43AT/eyq+ckfvMnc6x+mkb8SereHFMIp9qEKH3ij83Tq4cH8774eqz0VqP6cj/PKM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.191 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4bQg8L3Nb3z28fRg; Mon, 23 Jun 2025 15:44:34 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 6EA941401F2; Mon, 23 Jun 2025 15:47:01 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:00 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 08/16] ext4: merge freed extent with existing extents before insertion Date: Mon, 23 Jun 2025 15:32:56 +0800 Message-ID: <20250623073304.3275702-9-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Attempt to merge ext4_free_data with already inserted free extents prior to adding new ones. This strategy drastically cuts down the number of times locks are held. For example, if prev, new, and next extents are all mergeable, the existing code (before this patch) requires acquiring the s_md_lock three times: prev merge into new and free prev // hold lock next merge into new and free next // hold lock insert new // hold lock After the patch, it only needs to be acquired once: new merge next and free new // no lock next merge into prev and free prev // hold lock Performance test data follows: Test: Running will-it-scale/fallocate2 on CPU-bound containers. Observation: Average fallocate operations per container per second. | Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 | Disk: 960GB SSD |-------------------------|-------------------------| | base | patched | base | patched | Reviewed-by: Jan Kara -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 20982 | 21157 (+0.8%) | 50629 | 50420 (-0.4%) | mb_optimize_scan=3D1 | 10703 | 12896 (+20.4%) | 14856 | 17273 (+16.2%) | Signed-off-by: Baokun Li --- fs/ext4/mballoc.c | 113 +++++++++++++++++++++++++++++++--------------- 1 file changed, 76 insertions(+), 37 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 5410fb3688ee..94950b07a577 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -6298,28 +6298,63 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle, * are contiguous, AND the extents were freed by the same transaction, * AND the blocks are associated with the same group. */ -static void ext4_try_merge_freed_extent(struct ext4_sb_info *sbi, - struct ext4_free_data *entry, - struct ext4_free_data *new_entry, - struct rb_root *entry_rb_root) +static inline bool +ext4_freed_extents_can_be_merged(struct ext4_free_data *entry1, + struct ext4_free_data *entry2) { - if ((entry->efd_tid !=3D new_entry->efd_tid) || - (entry->efd_group !=3D new_entry->efd_group)) - return; - if (entry->efd_start_cluster + entry->efd_count =3D=3D - new_entry->efd_start_cluster) { - new_entry->efd_start_cluster =3D entry->efd_start_cluster; - new_entry->efd_count +=3D entry->efd_count; - } else if (new_entry->efd_start_cluster + new_entry->efd_count =3D=3D - entry->efd_start_cluster) { - new_entry->efd_count +=3D entry->efd_count; - } else - return; + if (entry1->efd_tid !=3D entry2->efd_tid) + return false; + if (entry1->efd_start_cluster + entry1->efd_count !=3D + entry2->efd_start_cluster) + return false; + if (WARN_ON_ONCE(entry1->efd_group !=3D entry2->efd_group)) + return false; + return true; +} + +static inline void +ext4_merge_freed_extents(struct ext4_sb_info *sbi, struct rb_root *root, + struct ext4_free_data *entry1, + struct ext4_free_data *entry2) +{ + entry1->efd_count +=3D entry2->efd_count; spin_lock(&sbi->s_md_lock); - list_del(&entry->efd_list); + list_del(&entry2->efd_list); spin_unlock(&sbi->s_md_lock); - rb_erase(&entry->efd_node, entry_rb_root); - kmem_cache_free(ext4_free_data_cachep, entry); + rb_erase(&entry2->efd_node, root); + kmem_cache_free(ext4_free_data_cachep, entry2); +} + +static inline void +ext4_try_merge_freed_extent_prev(struct ext4_sb_info *sbi, struct rb_root = *root, + struct ext4_free_data *entry) +{ + struct ext4_free_data *prev; + struct rb_node *node; + + node =3D rb_prev(&entry->efd_node); + if (!node) + return; + + prev =3D rb_entry(node, struct ext4_free_data, efd_node); + if (ext4_freed_extents_can_be_merged(prev, entry)) + ext4_merge_freed_extents(sbi, root, prev, entry); +} + +static inline void +ext4_try_merge_freed_extent_next(struct ext4_sb_info *sbi, struct rb_root = *root, + struct ext4_free_data *entry) +{ + struct ext4_free_data *next; + struct rb_node *node; + + node =3D rb_next(&entry->efd_node); + if (!node) + return; + + next =3D rb_entry(node, struct ext4_free_data, efd_node); + if (ext4_freed_extents_can_be_merged(entry, next)) + ext4_merge_freed_extents(sbi, root, entry, next); } =20 static noinline_for_stack void @@ -6329,11 +6364,12 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4= _buddy *e4b, ext4_group_t group =3D e4b->bd_group; ext4_grpblk_t cluster; ext4_grpblk_t clusters =3D new_entry->efd_count; - struct ext4_free_data *entry; + struct ext4_free_data *entry =3D NULL; struct ext4_group_info *db =3D e4b->bd_info; struct super_block *sb =3D e4b->bd_sb; struct ext4_sb_info *sbi =3D EXT4_SB(sb); - struct rb_node **n =3D &db->bb_free_root.rb_node, *node; + struct rb_root *root =3D &db->bb_free_root; + struct rb_node **n =3D &root->rb_node; struct rb_node *parent =3D NULL, *new_node; =20 BUG_ON(!ext4_handle_valid(handle)); @@ -6369,27 +6405,30 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4= _buddy *e4b, } } =20 - rb_link_node(new_node, parent, n); - rb_insert_color(new_node, &db->bb_free_root); - - /* Now try to see the extent can be merged to left and right */ - node =3D rb_prev(new_node); - if (node) { - entry =3D rb_entry(node, struct ext4_free_data, efd_node); - ext4_try_merge_freed_extent(sbi, entry, new_entry, - &(db->bb_free_root)); + atomic_add(clusters, &sbi->s_mb_free_pending); + if (!entry) + goto insert; + + /* Now try to see the extent can be merged to prev and next */ + if (ext4_freed_extents_can_be_merged(new_entry, entry)) { + entry->efd_start_cluster =3D cluster; + entry->efd_count +=3D new_entry->efd_count; + kmem_cache_free(ext4_free_data_cachep, new_entry); + ext4_try_merge_freed_extent_prev(sbi, root, entry); + return; } - - node =3D rb_next(new_node); - if (node) { - entry =3D rb_entry(node, struct ext4_free_data, efd_node); - ext4_try_merge_freed_extent(sbi, entry, new_entry, - &(db->bb_free_root)); + if (ext4_freed_extents_can_be_merged(entry, new_entry)) { + entry->efd_count +=3D new_entry->efd_count; + kmem_cache_free(ext4_free_data_cachep, new_entry); + ext4_try_merge_freed_extent_next(sbi, root, entry); + return; } +insert: + rb_link_node(new_node, parent, n); + rb_insert_color(new_node, root); =20 spin_lock(&sbi->s_md_lock); list_add_tail(&new_entry->efd_list, &sbi->s_freed_data_list[new_entry->ef= d_tid & 1]); - atomic_add(clusters, &sbi->s_mb_free_pending); spin_unlock(&sbi->s_md_lock); } =20 --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 26BFD20298D; Mon, 23 Jun 2025 07:47:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664826; cv=none; b=XQmajQA5wiibVPUqTu9vGauRVMbh7rmKPQ48G4Esc1IpHHEjsSu4cmj/0NnW6I/aO/czusra+s0hVPQHwyuJ17af59rUS5NzRgU1KY1Fv8fYbpzRe7WeXzZYB5q9dKjDVRWt2wU/YgDp5ghtxeg4/2jbVM+cC6LAiypnrNwxncE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664826; c=relaxed/simple; bh=hSJ7W6iMQ4xKe7LZ1zPyhGNnD5O5rE8E3Ff+PUcoFOs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=c+mGQJUwD3S4a7QJDRXqvoBkWE9eugOCwsKt4fpnMfgRHjPASYwSuE/r4Yx1ILxjwWXOYAwbXoZO/Al0+jgAHKCPJ5NNs8PDBV1ZyhTbLm38lZh6cX25gqo5EDpBJZuZI+F/LaWepnRZjLGB2yKrn+N8UuAeJ28Bq1qqcZP4FTc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.163]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4bQgDF0gRhz2QVJ9; Mon, 23 Jun 2025 15:47:57 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 526CB18005F; Mon, 23 Jun 2025 15:47:02 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:01 +0800 From: Baokun Li To: CC: , , , , , , , , Subject: [PATCH v2 09/16] ext4: fix zombie groups in average fragment size lists Date: Mon, 23 Jun 2025 15:32:57 +0800 Message-ID: <20250623073304.3275702-10-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Groups with no free blocks shouldn't be in any average fragment size list. However, when all blocks in a group are allocated(i.e., bb_fragments or bb_free is 0), we currently skip updating the average fragment size, which means the group isn't removed from its previous s_mb_avg_fragment_size[old] list. This created "zombie" groups that were always skipped during traversal as they couldn't satisfy any block allocation requests, negatively impacting traversal efficiency. Therefore, when a group becomes completely free, bb_avg_fragment_size_order is now set to -1. If the old order was not -1, a removal operation is performed; if the new order is not -1, an insertion is performed. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li Reviewed-by: Jan Kara --- fs/ext4/mballoc.c | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 94950b07a577..e6d6c2da3c6e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -841,30 +841,30 @@ static void mb_update_avg_fragment_size(struct super_block *sb, struct ext4_group_info= *grp) { struct ext4_sb_info *sbi =3D EXT4_SB(sb); - int new_order; + int new, old; =20 - if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || grp->bb_fragments =3D=3D 0) + if (!test_opt2(sb, MB_OPTIMIZE_SCAN)) return; =20 - new_order =3D mb_avg_fragment_size_order(sb, - grp->bb_free / grp->bb_fragments); - if (new_order =3D=3D grp->bb_avg_fragment_size_order) + old =3D grp->bb_avg_fragment_size_order; + new =3D grp->bb_fragments =3D=3D 0 ? -1 : + mb_avg_fragment_size_order(sb, grp->bb_free / grp->bb_fragments); + if (new =3D=3D old) return; =20 - if (grp->bb_avg_fragment_size_order !=3D -1) { - write_lock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); + if (old >=3D 0) { + write_lock(&sbi->s_mb_avg_fragment_size_locks[old]); list_del(&grp->bb_avg_fragment_size_node); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); - } - grp->bb_avg_fragment_size_order =3D new_order; - write_lock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); - list_add_tail(&grp->bb_avg_fragment_size_node, - &sbi->s_mb_avg_fragment_size[grp->bb_avg_fragment_size_order]); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[ - grp->bb_avg_fragment_size_order]); + write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]); + } + + grp->bb_avg_fragment_size_order =3D new; + if (new >=3D 0) { + write_lock(&sbi->s_mb_avg_fragment_size_locks[new]); + list_add_tail(&grp->bb_avg_fragment_size_node, + &sbi->s_mb_avg_fragment_size[new]); + write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]); + } } =20 /* --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 42D1C2222B0; Mon, 23 Jun 2025 07:47:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664827; cv=none; b=krO9XuKEN6v+AQjc9hyJb7PL4hUMijKknaPIDLN6AzcRxW6fQf3oFuM3qwewNb7gYYD8p7hOXgRaJwIoj8l3M/uT4tfCC3SPF4htXjNgV8YmVba+w400gOQzRZEYq8TQTtVlukwqE6gsFxhBZN+Pg9B4JEF04Z0/0rQcpQO4LDA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664827; c=relaxed/simple; bh=+KSWrPtfAV8nq4GefxhPPB9oGsCXuefobotfsfhCvzw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=bZJYtkw22gP9XMa2wOTgp7YoL6bAE3SzGFRvcsiMj73nDAg8PFOZ4t+FSigc7nIq43OjXeRELv2ZW1tYIZnyBo6lHcTBluNt4MEXHff/83B+CGwkJqzCHuZ/71KNtFUs/JJUmbMzZTn+Ua56VN0yOTDZBowlTxfh5SNGm15M+6s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4bQg9s5C1vztS2X; Mon, 23 Jun 2025 15:45:53 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 379CC1400DC; Mon, 23 Jun 2025 15:47:03 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:02 +0800 From: Baokun Li To: CC: , , , , , , , , Subject: [PATCH v2 10/16] ext4: fix largest free orders lists corruption on mb_optimize_scan switch Date: Mon, 23 Jun 2025 15:32:58 +0800 Message-ID: <20250623073304.3275702-11-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" The grp->bb_largest_free_order is updated regardless of whether mb_optimize_scan is enabled. This can lead to inconsistencies between grp->bb_largest_free_order and the actual s_mb_largest_free_orders list index when mb_optimize_scan is repeatedly enabled and disabled via remount. For example, if mb_optimize_scan is initially enabled, largest free order is 3, and the group is in s_mb_largest_free_orders[3]. Then, mb_optimize_scan is disabled via remount, block allocations occur, updating largest free order to 2. Finally, mb_optimize_scan is re-enabled via remount, more block allocations update largest free order to 1. At this point, the group would be removed from s_mb_largest_free_orders[3] under the protection of s_mb_largest_free_orders_locks[2]. This lock mismatch can lead to list corruption. To fix this, a new field bb_largest_free_order_idx is added to struct ext4_group_info to explicitly track the list index. Then still update bb_largest_free_order unconditionally, but only update bb_largest_free_order_idx when mb_optimize_scan is enabled. so that there is no inconsistency between the lock and the data to be protected. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable@vger.kernel.org Signed-off-by: Baokun Li --- fs/ext4/ext4.h | 1 + fs/ext4/mballoc.c | 35 ++++++++++++++++------------------- 2 files changed, 17 insertions(+), 19 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 003b8d3726e8..0e574378c6a3 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3476,6 +3476,7 @@ struct ext4_group_info { int bb_avg_fragment_size_order; /* order of average fragment in BG */ ext4_grpblk_t bb_largest_free_order;/* order of largest frag in BG */ + ext4_grpblk_t bb_largest_free_order_idx; /* index of largest frag */ ext4_group_t bb_group; /* Group number */ struct list_head bb_prealloc_list; #ifdef DOUBLE_CHECK diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index e6d6c2da3c6e..dc82124f0905 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1152,33 +1152,29 @@ static void mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *= grp) { struct ext4_sb_info *sbi =3D EXT4_SB(sb); - int i; + int new, old =3D grp->bb_largest_free_order_idx; =20 - for (i =3D MB_NUM_ORDERS(sb) - 1; i >=3D 0; i--) - if (grp->bb_counters[i] > 0) + for (new =3D MB_NUM_ORDERS(sb) - 1; new >=3D 0; new--) + if (grp->bb_counters[new] > 0) break; + + grp->bb_largest_free_order =3D new; /* No need to move between order lists? */ - if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || - i =3D=3D grp->bb_largest_free_order) { - grp->bb_largest_free_order =3D i; + if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || new =3D=3D old) return; - } =20 - if (grp->bb_largest_free_order >=3D 0) { - write_lock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + if (old >=3D 0) { + write_lock(&sbi->s_mb_largest_free_orders_locks[old]); list_del_init(&grp->bb_largest_free_order_node); - write_unlock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + write_unlock(&sbi->s_mb_largest_free_orders_locks[old]); } - grp->bb_largest_free_order =3D i; - if (grp->bb_largest_free_order >=3D 0 && grp->bb_free) { - write_lock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + + grp->bb_largest_free_order_idx =3D new; + if (new >=3D 0 && grp->bb_free) { + write_lock(&sbi->s_mb_largest_free_orders_locks[new]); list_add_tail(&grp->bb_largest_free_order_node, - &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]); - write_unlock(&sbi->s_mb_largest_free_orders_locks[ - grp->bb_largest_free_order]); + &sbi->s_mb_largest_free_orders[new]); + write_unlock(&sbi->s_mb_largest_free_orders_locks[new]); } } =20 @@ -3391,6 +3387,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext= 4_group_t group, INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node); meta_group_info[i]->bb_largest_free_order =3D -1; /* uninit */ meta_group_info[i]->bb_avg_fragment_size_order =3D -1; /* uninit */ + meta_group_info[i]->bb_largest_free_order_idx =3D -1; /* uninit */ meta_group_info[i]->bb_group =3D group; =20 mb_group_bb_bitmap_alloc(sb, meta_group_info[i], group); --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DF93C2236E5; Mon, 23 Jun 2025 07:47:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.32 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664827; cv=none; b=pSois1JbNrrlffXjSidEzSWjkruEMjyXm9UFRQITa2r18Te8qrwFNURPRlTHbV6BeEPHwyiuHhzy++60HDLXNeBLAejHL0cWxKIJh5YA3Wkrc8BSFLsy3fzTvV4se2emqzuLsZvgTlmBbN8+VvmEixuPyojc23Z5usDqRLPGBmg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664827; c=relaxed/simple; bh=UCV57UWxV7Pt589XHnQVLcQ23WTfrIhhc6eN8MIogys=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=iahChSxfMHhc/aGZQJFIcfr3rPGtKDNBqfBtev6KoXFehC/5tY6ZcRTw99ybPz0ScNPzKX0AfKaCyfl1bDARl5y4rHMzmKO7XUxbp1q3bltd4b6v7AmBNUnTH1GKaS7X7Q3+ff7MeNKSpQHRxGSfvMa1zagaa1aXK48LM9jHfhw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.32 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.44]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4bQgDG5yZlz2QVJv; Mon, 23 Jun 2025 15:47:58 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 143EC140276; Mon, 23 Jun 2025 15:47:04 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:03 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 11/16] ext4: factor out __ext4_mb_scan_group() Date: Mon, 23 Jun 2025 15:32:59 +0800 Message-ID: <20250623073304.3275702-12-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Extract __ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li --- fs/ext4/mballoc.c | 45 +++++++++++++++++++++++++++------------------ fs/ext4/mballoc.h | 2 ++ 2 files changed, 29 insertions(+), 18 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index dc82124f0905..db5d8b1e5cce 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2569,6 +2569,30 @@ void ext4_mb_scan_aligned(struct ext4_allocation_con= text *ac, } } =20 +static void __ext4_mb_scan_group(struct ext4_allocation_context *ac) +{ + bool is_stripe_aligned; + struct ext4_sb_info *sbi; + enum criteria cr =3D ac->ac_criteria; + + ac->ac_groups_scanned++; + if (cr =3D=3D CR_POWER2_ALIGNED) + return ext4_mb_simple_scan_group(ac, ac->ac_e4b); + + sbi =3D EXT4_SB(ac->ac_sb); + is_stripe_aligned =3D false; + if ((sbi->s_stripe >=3D sbi->s_cluster_ratio) && + !(ac->ac_g_ex.fe_len % EXT4_NUM_B2C(sbi, sbi->s_stripe))) + is_stripe_aligned =3D true; + + if ((cr =3D=3D CR_GOAL_LEN_FAST || cr =3D=3D CR_BEST_AVAIL_LEN) && + is_stripe_aligned) + ext4_mb_scan_aligned(ac, ac->ac_e4b); + + if (ac->ac_status =3D=3D AC_STATUS_CONTINUE) + ext4_mb_complex_scan_group(ac, ac->ac_e4b); +} + /* * This is also called BEFORE we load the buddy bitmap. * Returns either 1 or 0 indicating that the group is either suitable @@ -2855,6 +2879,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) */ if (ac->ac_2order) cr =3D CR_POWER2_ALIGNED; + + ac->ac_e4b =3D &e4b; repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status =3D=3D AC_STATUS_CONTINUE; c= r++) { ac->ac_criteria =3D cr; @@ -2932,24 +2958,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) continue; } =20 - ac->ac_groups_scanned++; - if (cr =3D=3D CR_POWER2_ALIGNED) - ext4_mb_simple_scan_group(ac, &e4b); - else { - bool is_stripe_aligned =3D - (sbi->s_stripe >=3D - sbi->s_cluster_ratio) && - !(ac->ac_g_ex.fe_len % - EXT4_NUM_B2C(sbi, sbi->s_stripe)); - - if ((cr =3D=3D CR_GOAL_LEN_FAST || - cr =3D=3D CR_BEST_AVAIL_LEN) && - is_stripe_aligned) - ext4_mb_scan_aligned(ac, &e4b); - - if (ac->ac_status =3D=3D AC_STATUS_CONTINUE) - ext4_mb_complex_scan_group(ac, &e4b); - } + __ext4_mb_scan_group(ac); =20 ext4_unlock_group(sb, group); ext4_mb_unload_buddy(&e4b); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 38c37901728d..d61d690d237c 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -213,6 +213,8 @@ struct ext4_allocation_context { __u8 ac_2order; /* if request is to allocate 2^N blocks and * N > 0, the field stores N, otherwise 0 */ __u8 ac_op; /* operation, for history only */ + + struct ext4_buddy *ac_e4b; struct folio *ac_bitmap_folio; struct folio *ac_buddy_folio; struct ext4_prealloc_space *ac_pa; --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 10BF1224B1F; Mon, 23 Jun 2025 07:47:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664829; cv=none; b=N3pGR42/jEyCMLqH+iLRTUSomviCJ7M4357iqdGTIObc0uu6mb0psus/Ym8rlsATE4A5KMkEq/bFJO8fNPTDHhBK6P2IgzNxIHlOVy2fmLUK4Yqv6AkuBG8Ivoyb59n1TuNlUF61533ZnDDwwrcYIDjuKneVV+6JcgOsIBIOs5A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664829; c=relaxed/simple; bh=tBv+kl37r3AWknh+viFtPcAEOdRQmsnGP9h+vaIgkwI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HOW7+qkRonLJHFU4v6zrJvaI7lC4EROkEgmT8b0XHjgjwYrM2o7wPIDKN3aD97+O75vKLUZFI4OUid+UtwA2C7HzQVFBDaU4hF8oWbiQnIOJDb14b/LYSkTxbAySZJdAgaSN8YO4JRMk1pnT2VXwrgww+xooQZ8BnY4YrdkQxPw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4bQg9v3983ztS49; Mon, 23 Jun 2025 15:45:55 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id E5F6D180489; Mon, 23 Jun 2025 15:47:04 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:04 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 12/16] ext4: factor out ext4_mb_might_prefetch() Date: Mon, 23 Jun 2025 15:33:00 +0800 Message-ID: <20250623073304.3275702-13-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Extract ext4_mb_might_prefetch() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li --- fs/ext4/mballoc.c | 62 +++++++++++++++++++++++++++++------------------ fs/ext4/mballoc.h | 4 +++ 2 files changed, 42 insertions(+), 24 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index db5d8b1e5cce..683e7f8faab6 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2782,6 +2782,37 @@ ext4_group_t ext4_mb_prefetch(struct super_block *sb= , ext4_group_t group, return group; } =20 +/* + * Batch reads of the block allocation bitmaps to get + * multiple READs in flight; limit prefetching at inexpensive + * CR, otherwise mballoc can spend a lot of time loading + * imperfect groups + */ +static void ext4_mb_might_prefetch(struct ext4_allocation_context *ac, + ext4_group_t group) +{ + struct ext4_sb_info *sbi; + + if (ac->ac_prefetch_grp !=3D group) + return; + + sbi =3D EXT4_SB(ac->ac_sb); + if (ext4_mb_cr_expensive(ac->ac_criteria) || + ac->ac_prefetch_ios < sbi->s_mb_prefetch_limit) { + unsigned int nr =3D sbi->s_mb_prefetch; + + if (ext4_has_feature_flex_bg(ac->ac_sb)) { + nr =3D 1 << sbi->s_log_groups_per_flex; + nr -=3D group & (nr - 1); + nr =3D min(nr, sbi->s_mb_prefetch); + } + + ac->ac_prefetch_nr =3D nr; + ac->ac_prefetch_grp =3D ext4_mb_prefetch(ac->ac_sb, group, nr, + &ac->ac_prefetch_ios); + } +} + /* * Prefetching reads the block bitmap into the buffer cache; but we * need to make sure that the buddy bitmap in the page cache has been @@ -2818,10 +2849,9 @@ void ext4_mb_prefetch_fini(struct super_block *sb, e= xt4_group_t group, static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { - ext4_group_t prefetch_grp =3D 0, ngroups, group, i; + ext4_group_t ngroups, group, i; enum criteria new_cr, cr =3D CR_GOAL_LEN_FAST; int err =3D 0, first_err =3D 0; - unsigned int nr =3D 0, prefetch_ios =3D 0; struct ext4_sb_info *sbi; struct super_block *sb; struct ext4_buddy e4b; @@ -2881,6 +2911,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) cr =3D CR_POWER2_ALIGNED; =20 ac->ac_e4b =3D &e4b; + ac->ac_prefetch_ios =3D 0; repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status =3D=3D AC_STATUS_CONTINUE; c= r++) { ac->ac_criteria =3D cr; @@ -2890,8 +2921,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) */ group =3D ac->ac_g_ex.fe_group; ac->ac_groups_linear_remaining =3D sbi->s_mb_max_linear_groups; - prefetch_grp =3D group; - nr =3D 0; + ac->ac_prefetch_grp =3D group; + ac->ac_prefetch_nr =3D 0; =20 for (i =3D 0, new_cr =3D cr; i < ngroups; i++, ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { @@ -2903,24 +2934,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_con= text *ac) goto repeat; } =20 - /* - * Batch reads of the block allocation bitmaps - * to get multiple READs in flight; limit - * prefetching at inexpensive CR, otherwise mballoc - * can spend a lot of time loading imperfect groups - */ - if ((prefetch_grp =3D=3D group) && - (ext4_mb_cr_expensive(cr) || - prefetch_ios < sbi->s_mb_prefetch_limit)) { - nr =3D sbi->s_mb_prefetch; - if (ext4_has_feature_flex_bg(sb)) { - nr =3D 1 << sbi->s_log_groups_per_flex; - nr -=3D group & (nr - 1); - nr =3D min(nr, sbi->s_mb_prefetch); - } - prefetch_grp =3D ext4_mb_prefetch(sb, group, - nr, &prefetch_ios); - } + ext4_mb_might_prefetch(ac, group); =20 /* prevent unnecessary buddy loading. */ if (cr < CR_ANY_FREE && @@ -3014,8 +3028,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, ac->ac_flags, cr, err); =20 - if (nr) - ext4_mb_prefetch_fini(sb, prefetch_grp, nr); + if (ac->ac_prefetch_nr) + ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr); =20 return err; } diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index d61d690d237c..772ee0264d33 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -201,6 +201,10 @@ struct ext4_allocation_context { */ ext4_grpblk_t ac_orig_goal_len; =20 + ext4_group_t ac_prefetch_grp; + unsigned int ac_prefetch_ios; + unsigned int ac_prefetch_nr; + __u32 ac_flags; /* allocation hints */ __u32 ac_groups_linear_remaining; __u16 ac_groups_scanned; --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1A2D92264BF; Mon, 23 Jun 2025 07:47:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664830; cv=none; b=M78ktSSNq/UJrKCSheafRX8hBcMkSwbzJlGggOw1Wi0cKNMZbtbHWBJfzbyHHkifVboPn7aNvo3zOxJD5Bbl27QWRQJR9CeLQXeIr8vaWrJX2njy7PiIzsWRUo+XQO8IMisteT7/RLtPXvhWRNMLepr69zlyrUpBzbivcBBlEz8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664830; c=relaxed/simple; bh=tNupaSWSP01py596cKyGiIw4oa9w6oKuqTrHZ/XFIKY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=mWLX8+Egb0fPztm6cKgOdrMqqRETbbqRyRanEWlv5YmEDSu8pEVokKmMTHbE7M8jxWRnt+RStppKob7+3CeKo/LAg85tQBtoGaJnG2T++K/k9kJHIkjG043zdpsj76pi/ZFkOQP54dzNBzgKQ1Gp0mVlWrSnjDBj8Id+vRgJoD8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4bQg9w2Dp0ztS4G; Mon, 23 Jun 2025 15:45:56 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id C495B180B3F; Mon, 23 Jun 2025 15:47:05 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:04 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 13/16] ext4: factor out ext4_mb_scan_group() Date: Mon, 23 Jun 2025 15:33:01 +0800 Message-ID: <20250623073304.3275702-14-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Extract ext4_mb_scan_group() to make the code clearer and to prepare for the later conversion of 'choose group' to 'scan groups'. No functional changes. Signed-off-by: Baokun Li --- fs/ext4/mballoc.c | 93 +++++++++++++++++++++++++---------------------- fs/ext4/mballoc.h | 2 + 2 files changed, 51 insertions(+), 44 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 683e7f8faab6..2c4c2cf3e180 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2846,12 +2846,56 @@ void ext4_mb_prefetch_fini(struct super_block *sb, = ext4_group_t group, } } =20 +static int ext4_mb_scan_group(struct ext4_allocation_context *ac, + ext4_group_t group) +{ + int ret; + struct super_block *sb =3D ac->ac_sb; + enum criteria cr =3D ac->ac_criteria; + + ext4_mb_might_prefetch(ac, group); + + /* prevent unnecessary buddy loading. */ + if (cr < CR_ANY_FREE && spin_is_locked(ext4_group_lock_ptr(sb, group))) + return 0; + + /* This now checks without needing the buddy page */ + ret =3D ext4_mb_good_group_nolock(ac, group, cr); + if (ret <=3D 0) { + if (!ac->ac_first_err) + ac->ac_first_err =3D ret; + return 0; + } + + ret =3D ext4_mb_load_buddy(sb, group, ac->ac_e4b); + if (ret) + return ret; + + /* skip busy group */ + if (cr >=3D CR_ANY_FREE) + ext4_lock_group(sb, group); + else if (!ext4_try_lock_group(sb, group)) + goto out_unload; + + /* We need to check again after locking the block group. */ + if (unlikely(!ext4_mb_good_group(ac, group, cr))) + goto out_unlock; + + __ext4_mb_scan_group(ac); + +out_unlock: + ext4_unlock_group(sb, group); +out_unload: + ext4_mb_unload_buddy(ac->ac_e4b); + return ret; +} + static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { ext4_group_t ngroups, group, i; enum criteria new_cr, cr =3D CR_GOAL_LEN_FAST; - int err =3D 0, first_err =3D 0; + int err =3D 0; struct ext4_sb_info *sbi; struct super_block *sb; struct ext4_buddy e4b; @@ -2912,6 +2956,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) =20 ac->ac_e4b =3D &e4b; ac->ac_prefetch_ios =3D 0; + ac->ac_first_err =3D 0; repeat: for (; cr < EXT4_MB_NUM_CRS && ac->ac_status =3D=3D AC_STATUS_CONTINUE; c= r++) { ac->ac_criteria =3D cr; @@ -2926,7 +2971,6 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) =20 for (i =3D 0, new_cr =3D cr; i < ngroups; i++, ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { - int ret =3D 0; =20 cond_resched(); if (new_cr !=3D cr) { @@ -2934,49 +2978,10 @@ ext4_mb_regular_allocator(struct ext4_allocation_co= ntext *ac) goto repeat; } =20 - ext4_mb_might_prefetch(ac, group); - - /* prevent unnecessary buddy loading. */ - if (cr < CR_ANY_FREE && - spin_is_locked(ext4_group_lock_ptr(sb, group))) - continue; - - /* This now checks without needing the buddy page */ - ret =3D ext4_mb_good_group_nolock(ac, group, cr); - if (ret <=3D 0) { - if (!first_err) - first_err =3D ret; - continue; - } - - err =3D ext4_mb_load_buddy(sb, group, &e4b); + err =3D ext4_mb_scan_group(ac, group); if (err) goto out; =20 - /* skip busy group */ - if (cr >=3D CR_ANY_FREE) { - ext4_lock_group(sb, group); - } else if (!ext4_try_lock_group(sb, group)) { - ext4_mb_unload_buddy(&e4b); - continue; - } - - /* - * We need to check again after locking the - * block group - */ - ret =3D ext4_mb_good_group(ac, group, cr); - if (ret =3D=3D 0) { - ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); - continue; - } - - __ext4_mb_scan_group(ac); - - ext4_unlock_group(sb, group); - ext4_mb_unload_buddy(&e4b); - if (ac->ac_status !=3D AC_STATUS_CONTINUE) break; } @@ -3021,8 +3026,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) if (sbi->s_mb_stats && ac->ac_status =3D=3D AC_STATUS_FOUND) atomic64_inc(&sbi->s_bal_cX_hits[ac->ac_criteria]); out: - if (!err && ac->ac_status !=3D AC_STATUS_FOUND && first_err) - err =3D first_err; + if (!err && ac->ac_status !=3D AC_STATUS_FOUND && ac->ac_first_err) + err =3D ac->ac_first_err; =20 mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr= %d ret %d\n", ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 772ee0264d33..721aaea1f83e 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -205,6 +205,8 @@ struct ext4_allocation_context { unsigned int ac_prefetch_ios; unsigned int ac_prefetch_nr; =20 + int ac_first_err; + __u32 ac_flags; /* allocation hints */ __u32 ac_groups_linear_remaining; __u16 ac_groups_scanned; --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3CD7224893; Mon, 23 Jun 2025 07:47:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664831; cv=none; b=twJvIxFGOtmPOP6yHjYix2KPop6tkfEsGhB1xcBOBXEV5OiNcWG5uii2lYUO/mNG56B3kLtBzi5Hk4xmm6DMzhs413KqJY89bQjhgm4Mlsa047X6kvXD/uB97+ARzCKB6ZvoKRU2VboEKmL4gylernksvSm9CAVw2JzCDF4bNKs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664831; c=relaxed/simple; bh=UOaE8lQ9NBgo3S2R+7qmoHz0A+aFm2wPIsIk+djuQsc=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=AiwpWr5+wG3i8OD6CPqC4/C+SLgWfLXzHxyxoZV+qM6Q1ch8fpwO96EDF4Fpyzr9OOmp3X/86NxgK1gy2ZsTy25XDYGONSUcynLbrARhOBaXJxkEUYUZNdhK4/bZRCn7ny0iSDPg2Ykp0aZVl+lJFgP6vbjEBbqXtXnHvkfXf/Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4bQg8c4lZkz13MSR; Mon, 23 Jun 2025 15:44:48 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id A2870180489; Mon, 23 Jun 2025 15:47:06 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:05 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 14/16] ext4: convert free group lists to ordered xarrays Date: Mon, 23 Jun 2025 15:33:02 +0800 Message-ID: <20250623073304.3275702-15-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" While traversing the list, holding a spin_lock prevents load_buddy, making direct use of ext4_try_lock_group impossible. This can lead to a bouncing scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group() fails, forcing the list traversal to repeatedly restart from grp_A. In contrast, linear traversal directly uses ext4_try_lock_group(), avoiding this bouncing. Therefore, we need a lockless, ordered traversal to achieve linear-like efficiency. Therefore, this commit converts both average fragment size lists and largest free order lists into ordered xarrays. In an xarray, the index represents the block group number and the value holds the block group information; a non-empty value indicates the block group's presence. While insertion and deletion complexity remain O(1), lookup complexity changes from O(1) to O(nlogn), which may slightly reduce single-threaded performance. After this, we can convert choose group to scan group, and then we can implement ordered optimize scan. Performance test results are as follows: Single-process operations on an empty disk show negligible impact, while multi-process workloads demonstrate a noticeable performance gain. CPU: Kunpeng 920 | P80 | P1 | Memory: 512GB |-------------------------|-------------------------| Disk: 960GB SSD | base | patched | base | patched | -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 21157 | 20976 (-0.8%) | 320645| 319396 (-0.4%) | mb_optimize_scan=3D1 | 12896 | 14580 (+13.0%) | 321233| 319237 (-0.6%) | CPU: AMD 9654 * 2 | P96 | P1 | Memory: 1536GB |-------------------------|-------------------------| Disk: 960GB SSD | base | patched | base | patched | -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 50420 | 51713 (+2.5%) | 206570| 206655 (0.04%) | mb_optimize_scan=3D1 | 17273 | 35527 (+105%) | 208362| 212574 (+2.0%) | Signed-off-by: Baokun Li --- fs/ext4/ext4.h | 8 +- fs/ext4/mballoc.c | 255 ++++++++++++++++++++++++---------------------- 2 files changed, 137 insertions(+), 126 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 0e574378c6a3..64e1c978a89d 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1608,10 +1608,8 @@ struct ext4_sb_info { struct list_head s_discard_list; struct work_struct s_discard_work; atomic_t s_retry_alloc_pending; - struct list_head *s_mb_avg_fragment_size; - rwlock_t *s_mb_avg_fragment_size_locks; - struct list_head *s_mb_largest_free_orders; - rwlock_t *s_mb_largest_free_orders_locks; + struct xarray *s_mb_avg_fragment_size; + struct xarray *s_mb_largest_free_orders; =20 /* tunables */ unsigned long s_stripe; @@ -3483,8 +3481,6 @@ struct ext4_group_info { void *bb_bitmap; #endif struct rw_semaphore alloc_sem; - struct list_head bb_avg_fragment_size_node; - struct list_head bb_largest_free_order_node; ext4_grpblk_t bb_counters[]; /* Nr of free power-of-two-block * regions, index is order. * bb_counters[3] =3D 5 means diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 2c4c2cf3e180..45c7717fcbbd 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -132,25 +132,30 @@ * If "mb_optimize_scan" mount option is set, we maintain in memory group = info * structures in two data structures: * - * 1) Array of largest free order lists (sbi->s_mb_largest_free_orders) + * 1) Array of largest free order xarrays (sbi->s_mb_largest_free_orders) * - * Locking: sbi->s_mb_largest_free_orders_locks(array of rw locks) + * Locking: Writers use xa_lock, readers use rcu_read_lock. * - * This is an array of lists where the index in the array represents the + * This is an array of xarrays where the index in the array represents = the * largest free order in the buddy bitmap of the participating group in= fos of - * that list. So, there are exactly MB_NUM_ORDERS(sb) (which means total - * number of buddy bitmap orders possible) number of lists. Group-infos= are - * placed in appropriate lists. + * that xarray. So, there are exactly MB_NUM_ORDERS(sb) (which means to= tal + * number of buddy bitmap orders possible) number of xarrays. Group-inf= os are + * placed in appropriate xarrays. * - * 2) Average fragment size lists (sbi->s_mb_avg_fragment_size) + * 2) Average fragment size xarrays (sbi->s_mb_avg_fragment_size) * - * Locking: sbi->s_mb_avg_fragment_size_locks(array of rw locks) + * Locking: Writers use xa_lock, readers use rcu_read_lock. * - * This is an array of lists where in the i-th list there are groups wi= th + * This is an array of xarrays where in the i-th xarray there are group= s with * average fragment size >=3D 2^i and < 2^(i+1). The average fragment s= ize * is computed as ext4_group_info->bb_free / ext4_group_info->bb_fragme= nts. - * Note that we don't bother with a special list for completely empty g= roups - * so we only have MB_NUM_ORDERS(sb) lists. + * Note that we don't bother with a special xarray for completely empty + * groups so we only have MB_NUM_ORDERS(sb) xarrays. Group-infos are pl= aced + * in appropriate xarrays. + * + * In xarray, the index is the block group number, the value is the block = group + * information, and a non-empty value indicates the block group is present= in + * the current xarray. * * When "mb_optimize_scan" mount option is set, mballoc consults the above= data * structures to decide the order in which groups are to be traversed for @@ -842,6 +847,7 @@ mb_update_avg_fragment_size(struct super_block *sb, str= uct ext4_group_info *grp) { struct ext4_sb_info *sbi =3D EXT4_SB(sb); int new, old; + int ret; =20 if (!test_opt2(sb, MB_OPTIMIZE_SCAN)) return; @@ -852,19 +858,71 @@ mb_update_avg_fragment_size(struct super_block *sb, s= truct ext4_group_info *grp) if (new =3D=3D old) return; =20 - if (old >=3D 0) { - write_lock(&sbi->s_mb_avg_fragment_size_locks[old]); - list_del(&grp->bb_avg_fragment_size_node); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[old]); - } + if (old >=3D 0) + xa_erase(&sbi->s_mb_avg_fragment_size[old], grp->bb_group); =20 grp->bb_avg_fragment_size_order =3D new; - if (new >=3D 0) { - write_lock(&sbi->s_mb_avg_fragment_size_locks[new]); - list_add_tail(&grp->bb_avg_fragment_size_node, - &sbi->s_mb_avg_fragment_size[new]); - write_unlock(&sbi->s_mb_avg_fragment_size_locks[new]); + if (new < 0) + return; + + ret =3D xa_insert(&sbi->s_mb_avg_fragment_size[new], + grp->bb_group, grp, GFP_ATOMIC); + if (!ret) + return; + ext4_warning(sb, "insert group: %u to s_mb_avg_fragment_size[%d] failed, = err %d", + grp->bb_group, new, ret); +} + +static struct ext4_group_info * +ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac, + struct xarray *xa, ext4_group_t start) +{ + struct super_block *sb =3D ac->ac_sb; + struct ext4_sb_info *sbi =3D EXT4_SB(sb); + enum criteria cr =3D ac->ac_criteria; + ext4_group_t ngroups =3D ext4_get_groups_count(sb); + unsigned long group =3D start; + ext4_group_t end; + struct ext4_group_info *grp; + + if (WARN_ON_ONCE(start >=3D ngroups)) + return NULL; + end =3D ngroups - 1; + +wrap_around: + xa_for_each_range(xa, group, grp, start, end) { + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); + + if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) && + likely(ext4_mb_good_group(ac, group, cr))) + return grp; + + cond_resched(); + } + + if (start) { + end =3D start - 1; + start =3D 0; + goto wrap_around; } + + return NULL; +} + +/* + * Find a suitable group of given order from the largest free orders xarra= y. + */ +static struct ext4_group_info * +ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context = *ac, + int order, ext4_group_t start) +{ + struct xarray *xa =3D &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order= ]; + + if (xa_empty(xa)) + return NULL; + + return ext4_mb_find_good_group_xarray(ac, xa, start); } =20 /* @@ -875,7 +933,7 @@ static void ext4_mb_choose_next_group_p2_aligned(struct= ext4_allocation_context enum criteria *new_cr, ext4_group_t *group) { struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); - struct ext4_group_info *iter; + struct ext4_group_info *grp; int i; =20 if (ac->ac_status =3D=3D AC_STATUS_FOUND) @@ -885,26 +943,12 @@ static void ext4_mb_choose_next_group_p2_aligned(stru= ct ext4_allocation_context atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions); =20 for (i =3D ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - if (list_empty(&sbi->s_mb_largest_free_orders[i])) - continue; - read_lock(&sbi->s_mb_largest_free_orders_locks[i]); - if (list_empty(&sbi->s_mb_largest_free_orders[i])) { - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); - continue; - } - list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i], - bb_largest_free_order_node) { - if (sbi->s_mb_stats) - atomic64_inc(&sbi->s_bal_cX_groups_considered[CR_POWER2_ALIGNED]); - if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && - likely(ext4_mb_good_group(ac, iter->bb_group, CR_POWER2_ALIGNED))) { - *group =3D iter->bb_group; - ac->ac_flags |=3D EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); - return; - } + grp =3D ext4_mb_find_good_group_largest_free_order(ac, i, *group); + if (grp) { + *group =3D grp->bb_group; + ac->ac_flags |=3D EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; + return; } - read_unlock(&sbi->s_mb_largest_free_orders_locks[i]); } =20 /* Increment cr and search again if no group is found */ @@ -912,35 +956,18 @@ static void ext4_mb_choose_next_group_p2_aligned(stru= ct ext4_allocation_context } =20 /* - * Find a suitable group of given order from the average fragments list. + * Find a suitable group of given order from the average fragments xarray. */ static struct ext4_group_info * -ext4_mb_find_good_group_avg_frag_lists(struct ext4_allocation_context *ac,= int order) +ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac, + int order, ext4_group_t start) { - struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); - struct list_head *frag_list =3D &sbi->s_mb_avg_fragment_size[order]; - rwlock_t *frag_list_lock =3D &sbi->s_mb_avg_fragment_size_locks[order]; - struct ext4_group_info *grp =3D NULL, *iter; - enum criteria cr =3D ac->ac_criteria; + struct xarray *xa =3D &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order]; =20 - if (list_empty(frag_list)) - return NULL; - read_lock(frag_list_lock); - if (list_empty(frag_list)) { - read_unlock(frag_list_lock); + if (xa_empty(xa)) return NULL; - } - list_for_each_entry(iter, frag_list, bb_avg_fragment_size_node) { - if (sbi->s_mb_stats) - atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); - if (!spin_is_locked(ext4_group_lock_ptr(ac->ac_sb, iter->bb_group)) && - likely(ext4_mb_good_group(ac, iter->bb_group, cr))) { - grp =3D iter; - break; - } - } - read_unlock(frag_list_lock); - return grp; + + return ext4_mb_find_good_group_xarray(ac, xa, start); } =20 /* @@ -961,7 +988,7 @@ static void ext4_mb_choose_next_group_goal_fast(struct = ext4_allocation_context * =20 for (i =3D mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); i < MB_NUM_ORDERS(ac->ac_sb); i++) { - grp =3D ext4_mb_find_good_group_avg_frag_lists(ac, i); + grp =3D ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group); if (grp) { *group =3D grp->bb_group; ac->ac_flags |=3D EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED; @@ -1057,7 +1084,8 @@ static void ext4_mb_choose_next_group_best_avail(stru= ct ext4_allocation_context frag_order =3D mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); =20 - grp =3D ext4_mb_find_good_group_avg_frag_lists(ac, frag_order); + grp =3D ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order, + *group); if (grp) { *group =3D grp->bb_group; ac->ac_flags |=3D EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED; @@ -1153,6 +1181,7 @@ mb_set_largest_free_order(struct super_block *sb, str= uct ext4_group_info *grp) { struct ext4_sb_info *sbi =3D EXT4_SB(sb); int new, old =3D grp->bb_largest_free_order_idx; + int ret; =20 for (new =3D MB_NUM_ORDERS(sb) - 1; new >=3D 0; new--) if (grp->bb_counters[new] > 0) @@ -1163,19 +1192,19 @@ mb_set_largest_free_order(struct super_block *sb, s= truct ext4_group_info *grp) if (!test_opt2(sb, MB_OPTIMIZE_SCAN) || new =3D=3D old) return; =20 - if (old >=3D 0) { - write_lock(&sbi->s_mb_largest_free_orders_locks[old]); - list_del_init(&grp->bb_largest_free_order_node); - write_unlock(&sbi->s_mb_largest_free_orders_locks[old]); - } + if (old >=3D 0) + xa_erase(&sbi->s_mb_largest_free_orders[old], grp->bb_group); =20 grp->bb_largest_free_order_idx =3D new; - if (new >=3D 0 && grp->bb_free) { - write_lock(&sbi->s_mb_largest_free_orders_locks[new]); - list_add_tail(&grp->bb_largest_free_order_node, - &sbi->s_mb_largest_free_orders[new]); - write_unlock(&sbi->s_mb_largest_free_orders_locks[new]); - } + if (new < 0 || !grp->bb_free) + return; + + ret =3D xa_insert(&sbi->s_mb_largest_free_orders[new], + grp->bb_group, grp, GFP_ATOMIC); + if (!ret) + return; + ext4_warning(sb, "insert group: %u to s_mb_largest_free_orders[%d] failed= , err %d", + grp->bb_group, new, ret); } =20 static noinline_for_stack @@ -3263,6 +3292,7 @@ static int ext4_mb_seq_structs_summary_show(struct se= q_file *seq, void *v) unsigned long position =3D ((unsigned long) v); struct ext4_group_info *grp; unsigned int count; + unsigned long idx; =20 position--; if (position >=3D MB_NUM_ORDERS(sb)) { @@ -3271,11 +3301,8 @@ static int ext4_mb_seq_structs_summary_show(struct s= eq_file *seq, void *v) seq_puts(seq, "avg_fragment_size_lists:\n"); =20 count =3D 0; - read_lock(&sbi->s_mb_avg_fragment_size_locks[position]); - list_for_each_entry(grp, &sbi->s_mb_avg_fragment_size[position], - bb_avg_fragment_size_node) + xa_for_each(&sbi->s_mb_avg_fragment_size[position], idx, grp) count++; - read_unlock(&sbi->s_mb_avg_fragment_size_locks[position]); seq_printf(seq, "\tlist_order_%u_groups: %u\n", (unsigned int)position, count); return 0; @@ -3287,11 +3314,8 @@ static int ext4_mb_seq_structs_summary_show(struct s= eq_file *seq, void *v) seq_puts(seq, "max_free_order_lists:\n"); } count =3D 0; - read_lock(&sbi->s_mb_largest_free_orders_locks[position]); - list_for_each_entry(grp, &sbi->s_mb_largest_free_orders[position], - bb_largest_free_order_node) + xa_for_each(&sbi->s_mb_largest_free_orders[position], idx, grp) count++; - read_unlock(&sbi->s_mb_largest_free_orders_locks[position]); seq_printf(seq, "\tlist_order_%u_groups: %u\n", (unsigned int)position, count); =20 @@ -3411,8 +3435,6 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext= 4_group_t group, INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list); init_rwsem(&meta_group_info[i]->alloc_sem); meta_group_info[i]->bb_free_root =3D RB_ROOT; - INIT_LIST_HEAD(&meta_group_info[i]->bb_largest_free_order_node); - INIT_LIST_HEAD(&meta_group_info[i]->bb_avg_fragment_size_node); meta_group_info[i]->bb_largest_free_order =3D -1; /* uninit */ meta_group_info[i]->bb_avg_fragment_size_order =3D -1; /* uninit */ meta_group_info[i]->bb_largest_free_order_idx =3D -1; /* uninit */ @@ -3623,6 +3645,20 @@ static void ext4_discard_work(struct work_struct *wo= rk) ext4_mb_unload_buddy(&e4b); } =20 +static inline void ext4_mb_avg_fragment_size_destory(struct ext4_sb_info *= sbi) +{ + for (int i =3D 0; i < MB_NUM_ORDERS(sbi->s_sb); i++) + xa_destroy(&sbi->s_mb_avg_fragment_size[i]); + kfree(sbi->s_mb_avg_fragment_size); +} + +static inline void ext4_mb_largest_free_orders_destory(struct ext4_sb_info= *sbi) +{ + for (int i =3D 0; i < MB_NUM_ORDERS(sbi->s_sb); i++) + xa_destroy(&sbi->s_mb_largest_free_orders[i]); + kfree(sbi->s_mb_largest_free_orders); +} + int ext4_mb_init(struct super_block *sb) { struct ext4_sb_info *sbi =3D EXT4_SB(sb); @@ -3668,41 +3704,24 @@ int ext4_mb_init(struct super_block *sb) } while (i < MB_NUM_ORDERS(sb)); =20 sbi->s_mb_avg_fragment_size =3D - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray), GFP_KERNEL); if (!sbi->s_mb_avg_fragment_size) { ret =3D -ENOMEM; goto out; } - sbi->s_mb_avg_fragment_size_locks =3D - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t), - GFP_KERNEL); - if (!sbi->s_mb_avg_fragment_size_locks) { - ret =3D -ENOMEM; - goto out; - } - for (i =3D 0; i < MB_NUM_ORDERS(sb); i++) { - INIT_LIST_HEAD(&sbi->s_mb_avg_fragment_size[i]); - rwlock_init(&sbi->s_mb_avg_fragment_size_locks[i]); - } + for (i =3D 0; i < MB_NUM_ORDERS(sb); i++) + xa_init(&sbi->s_mb_avg_fragment_size[i]); + sbi->s_mb_largest_free_orders =3D - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct list_head), + kmalloc_array(MB_NUM_ORDERS(sb), sizeof(struct xarray), GFP_KERNEL); if (!sbi->s_mb_largest_free_orders) { ret =3D -ENOMEM; goto out; } - sbi->s_mb_largest_free_orders_locks =3D - kmalloc_array(MB_NUM_ORDERS(sb), sizeof(rwlock_t), - GFP_KERNEL); - if (!sbi->s_mb_largest_free_orders_locks) { - ret =3D -ENOMEM; - goto out; - } - for (i =3D 0; i < MB_NUM_ORDERS(sb); i++) { - INIT_LIST_HEAD(&sbi->s_mb_largest_free_orders[i]); - rwlock_init(&sbi->s_mb_largest_free_orders_locks[i]); - } + for (i =3D 0; i < MB_NUM_ORDERS(sb); i++) + xa_init(&sbi->s_mb_largest_free_orders[i]); =20 spin_lock_init(&sbi->s_md_lock); atomic_set(&sbi->s_mb_free_pending, 0); @@ -3785,10 +3804,8 @@ int ext4_mb_init(struct super_block *sb) kvfree(sbi->s_mb_last_groups); sbi->s_mb_last_groups =3D NULL; out: - kfree(sbi->s_mb_avg_fragment_size); - kfree(sbi->s_mb_avg_fragment_size_locks); - kfree(sbi->s_mb_largest_free_orders); - kfree(sbi->s_mb_largest_free_orders_locks); + ext4_mb_avg_fragment_size_destory(sbi); + ext4_mb_largest_free_orders_destory(sbi); kfree(sbi->s_mb_offsets); sbi->s_mb_offsets =3D NULL; kfree(sbi->s_mb_maxs); @@ -3855,10 +3872,8 @@ void ext4_mb_release(struct super_block *sb) kvfree(group_info); rcu_read_unlock(); } - kfree(sbi->s_mb_avg_fragment_size); - kfree(sbi->s_mb_avg_fragment_size_locks); - kfree(sbi->s_mb_largest_free_orders); - kfree(sbi->s_mb_largest_free_orders_locks); + ext4_mb_avg_fragment_size_destory(sbi); + ext4_mb_largest_free_orders_destory(sbi); kfree(sbi->s_mb_offsets); kfree(sbi->s_mb_maxs); iput(sbi->s_buddy_cache); --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga07-in.huawei.com (szxga07-in.huawei.com [45.249.212.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F2056225413; Mon, 23 Jun 2025 07:47:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.35 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664832; cv=none; b=DOch3hNCbDqKJTfPi7WOogUOseO9Ul9C7/8hDU7e1/D3+qP+Y4LToGplFQfxGWSpAb0gs2zr4XIlEqJkiqRAHALz/VmHW/KRWVEsE4x3rv+y/JKXmXZSKp5fztdAM0ggNwDsHcz+cUWwNpVq+A3UBesc5g1taVHsT5jqWubAPnY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664832; c=relaxed/simple; bh=hBWHrwGIgzWMQsJ34MCwJRKM5kV7QYewRLPHghxKG/0=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=A2sCsSqFVYrPrIOZzNO3J3Qr7O4JEp665yPBnL25+3GcsuuOuPSTPdgvm1+E0Fw9zBkAYtC1+jrlfvA2Xt3I8Q6lhIjnb4sgZgHbPOohnAFZVx8ldGR95YND1GLiCGnnkDbpe3xU4MWsDy+SL7m7qHN/oL+rMqkbS3HgtqNEz5k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.35 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4bQg9S1R6cz29dkj; Mon, 23 Jun 2025 15:45:32 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 7EABF14027A; Mon, 23 Jun 2025 15:47:07 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:06 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 15/16] ext4: refactor choose group to scan group Date: Mon, 23 Jun 2025 15:33:03 +0800 Message-ID: <20250623073304.3275702-16-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" This commit converts the `choose group` logic to `scan group` using previously prepared helper functions. This allows us to leverage xarrays for ordered non-linear traversal, thereby mitigating the "bouncing" issue inherent in the `choose group` mechanism. This also decouples linear and non-linear traversals, leading to cleaner and more readable code. Signed-off-by: Baokun Li --- fs/ext4/mballoc.c | 310 ++++++++++++++++++++++++---------------------- fs/ext4/mballoc.h | 1 - 2 files changed, 159 insertions(+), 152 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 45c7717fcbbd..d8372a649a0c 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -432,6 +432,10 @@ static int ext4_try_to_trim_range(struct super_block *= sb, struct ext4_buddy *e4b, ext4_grpblk_t start, ext4_grpblk_t max, ext4_grpblk_t minblocks); =20 +static int ext4_mb_scan_group(struct ext4_allocation_context *ac, + ext4_group_t group); +static void ext4_mb_might_prefetch(struct ext4_allocation_context *ac, + ext4_group_t group); /* * The algorithm using this percpu seq counter goes below: * 1. We sample the percpu discard_pa_seq counter before trying for block @@ -873,9 +877,8 @@ mb_update_avg_fragment_size(struct super_block *sb, str= uct ext4_group_info *grp) grp->bb_group, new, ret); } =20 -static struct ext4_group_info * -ext4_mb_find_good_group_xarray(struct ext4_allocation_context *ac, - struct xarray *xa, ext4_group_t start) +static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac, + struct xarray *xa, ext4_group_t start) { struct super_block *sb =3D ac->ac_sb; struct ext4_sb_info *sbi =3D EXT4_SB(sb); @@ -886,17 +889,19 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation= _context *ac, struct ext4_group_info *grp; =20 if (WARN_ON_ONCE(start >=3D ngroups)) - return NULL; + return 0; end =3D ngroups - 1; =20 wrap_around: xa_for_each_range(xa, group, grp, start, end) { + int err; + if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_groups_considered[cr]); =20 - if (!spin_is_locked(ext4_group_lock_ptr(sb, group)) && - likely(ext4_mb_good_group(ac, group, cr))) - return grp; + err =3D ext4_mb_scan_group(ac, grp->bb_group); + if (err || ac->ac_status !=3D AC_STATUS_CONTINUE) + return err; =20 cond_resched(); } @@ -907,95 +912,86 @@ ext4_mb_find_good_group_xarray(struct ext4_allocation= _context *ac, goto wrap_around; } =20 - return NULL; + return 0; } =20 /* * Find a suitable group of given order from the largest free orders xarra= y. */ -static struct ext4_group_info * -ext4_mb_find_good_group_largest_free_order(struct ext4_allocation_context = *ac, - int order, ext4_group_t start) +static int +ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac, + int order, ext4_group_t start) { struct xarray *xa =3D &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order= ]; =20 if (xa_empty(xa)) - return NULL; + return 0; =20 - return ext4_mb_find_good_group_xarray(ac, xa, start); + return ext4_mb_scan_groups_xarray(ac, xa, start); } =20 /* * Choose next group by traversing largest_free_order lists. Updates *new_= cr if * cr level needs an update. */ -static void ext4_mb_choose_next_group_p2_aligned(struct ext4_allocation_co= ntext *ac, - enum criteria *new_cr, ext4_group_t *group) +static int ext4_mb_scan_groups_p2_aligned(struct ext4_allocation_context *= ac, + ext4_group_t group) { struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); - struct ext4_group_info *grp; int i; + int ret =3D 0; =20 - if (ac->ac_status =3D=3D AC_STATUS_FOUND) - return; - - if (unlikely(sbi->s_mb_stats && ac->ac_flags & EXT4_MB_CR_POWER2_ALIGNED_= OPTIMIZED)) - atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions); - + ac->ac_flags |=3D EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; for (i =3D ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - grp =3D ext4_mb_find_good_group_largest_free_order(ac, i, *group); - if (grp) { - *group =3D grp->bb_group; - ac->ac_flags |=3D EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; - return; - } + ret =3D ext4_mb_scan_groups_largest_free_order(ac, i, group); + if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) + goto out; } =20 + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); + /* Increment cr and search again if no group is found */ - *new_cr =3D CR_GOAL_LEN_FAST; + ac->ac_criteria =3D CR_GOAL_LEN_FAST; +out: + ac->ac_flags &=3D ~EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; + return ret; } =20 /* * Find a suitable group of given order from the average fragments xarray. */ -static struct ext4_group_info * -ext4_mb_find_good_group_avg_frag_xarray(struct ext4_allocation_context *ac, - int order, ext4_group_t start) +static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_conte= xt *ac, + int order, ext4_group_t start) { struct xarray *xa =3D &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order]; =20 if (xa_empty(xa)) - return NULL; + return 0; =20 - return ext4_mb_find_good_group_xarray(ac, xa, start); + return ext4_mb_scan_groups_xarray(ac, xa, start); } =20 /* * Choose next group by traversing average fragment size list of suitable * order. Updates *new_cr if cr level needs an update. */ -static void ext4_mb_choose_next_group_goal_fast(struct ext4_allocation_con= text *ac, - enum criteria *new_cr, ext4_group_t *group) +static int ext4_mb_scan_groups_goal_fast(struct ext4_allocation_context *a= c, + ext4_group_t group) { struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); - struct ext4_group_info *grp =3D NULL; - int i; - - if (unlikely(ac->ac_flags & EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED)) { - if (sbi->s_mb_stats) - atomic_inc(&sbi->s_bal_goal_fast_bad_suggestions); - } + int i, ret =3D 0; =20 - for (i =3D mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); - i < MB_NUM_ORDERS(ac->ac_sb); i++) { - grp =3D ext4_mb_find_good_group_avg_frag_xarray(ac, i, *group); - if (grp) { - *group =3D grp->bb_group; - ac->ac_flags |=3D EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED; - return; - } + ac->ac_flags |=3D EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED; + i =3D mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); + for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) { + ret =3D ext4_mb_scan_groups_avg_frag_order(ac, i, group); + if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) + goto out; } =20 + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); /* * CR_BEST_AVAIL_LEN works based on the concept that we have * a larger normalized goal len request which can be trimmed to @@ -1005,9 +1001,12 @@ static void ext4_mb_choose_next_group_goal_fast(stru= ct ext4_allocation_context * * See function ext4_mb_normalize_request() (EXT4_MB_HINT_DATA). */ if (ac->ac_flags & EXT4_MB_HINT_DATA) - *new_cr =3D CR_BEST_AVAIL_LEN; + ac->ac_criteria =3D CR_BEST_AVAIL_LEN; else - *new_cr =3D CR_GOAL_LEN_SLOW; + ac->ac_criteria =3D CR_GOAL_LEN_SLOW; +out: + ac->ac_flags &=3D ~EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED; + return ret; } =20 /* @@ -1019,19 +1018,14 @@ static void ext4_mb_choose_next_group_goal_fast(str= uct ext4_allocation_context * * preallocations. However, we make sure that we don't trim the request too * much and fall to CR_GOAL_LEN_SLOW in that case. */ -static void ext4_mb_choose_next_group_best_avail(struct ext4_allocation_co= ntext *ac, - enum criteria *new_cr, ext4_group_t *group) +static int ext4_mb_scan_groups_best_avail(struct ext4_allocation_context *= ac, + ext4_group_t group) { + int ret =3D 0; struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); - struct ext4_group_info *grp =3D NULL; int i, order, min_order; unsigned long num_stripe_clusters =3D 0; =20 - if (unlikely(ac->ac_flags & EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED)) { - if (sbi->s_mb_stats) - atomic_inc(&sbi->s_bal_best_avail_bad_suggestions); - } - /* * mb_avg_fragment_size_order() returns order in a way that makes * retrieving back the length using (1 << order) inaccurate. Hence, use @@ -1062,6 +1056,7 @@ static void ext4_mb_choose_next_group_best_avail(stru= ct ext4_allocation_context if (1 << min_order < ac->ac_o_ex.fe_len) min_order =3D fls(ac->ac_o_ex.fe_len); =20 + ac->ac_flags |=3D EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED; for (i =3D order; i >=3D min_order; i--) { int frag_order; /* @@ -1084,18 +1079,19 @@ static void ext4_mb_choose_next_group_best_avail(st= ruct ext4_allocation_context frag_order =3D mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); =20 - grp =3D ext4_mb_find_good_group_avg_frag_xarray(ac, frag_order, - *group); - if (grp) { - *group =3D grp->bb_group; - ac->ac_flags |=3D EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED; - return; - } + ret =3D ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group); + if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) + goto out; } =20 /* Reset goal length to original goal length before falling into CR_GOAL_= LEN_SLOW */ ac->ac_g_ex.fe_len =3D ac->ac_orig_goal_len; - *new_cr =3D CR_GOAL_LEN_SLOW; + if (sbi->s_mb_stats) + atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); + ac->ac_criteria =3D CR_GOAL_LEN_SLOW; +out: + ac->ac_flags &=3D ~EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED; + return ret; } =20 static inline int should_optimize_scan(struct ext4_allocation_context *ac) @@ -1110,59 +1106,87 @@ static inline int should_optimize_scan(struct ext4_= allocation_context *ac) } =20 /* - * Return next linear group for allocation. + * next linear group for allocation. */ -static ext4_group_t -next_linear_group(ext4_group_t group, ext4_group_t ngroups) +static void next_linear_group(ext4_group_t *group, ext4_group_t ngroups) { /* * Artificially restricted ngroups for non-extent * files makes group > ngroups possible on first loop. */ - return group + 1 >=3D ngroups ? 0 : group + 1; + *group =3D *group + 1 >=3D ngroups ? 0 : *group + 1; +} + +static int ext4_mb_scan_groups_linear(struct ext4_allocation_context *ac, + ext4_group_t ngroups, ext4_group_t *start, ext4_group_t count) +{ + int ret, i; + enum criteria cr =3D ac->ac_criteria; + struct super_block *sb =3D ac->ac_sb; + struct ext4_sb_info *sbi =3D EXT4_SB(sb); + ext4_group_t group =3D *start; + + for (i =3D 0; i < count; i++, next_linear_group(&group, ngroups)) { + ret =3D ext4_mb_scan_group(ac, group); + if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) + return ret; + cond_resched(); + } + + *start =3D group; + if (count =3D=3D ngroups) + ac->ac_criteria++; + + /* Processed all groups and haven't found blocks */ + if (sbi->s_mb_stats && i =3D=3D ngroups) + atomic64_inc(&sbi->s_bal_cX_failed[cr]); + + return 0; } =20 /* - * ext4_mb_choose_next_group: choose next group for allocation. + * ext4_mb_scan_groups: . * * @ac Allocation Context - * @new_cr This is an output parameter. If the there is no good group - * available at current CR level, this field is updated to indi= cate - * the new cr level that should be used. - * @group This is an input / output parameter. As an input it indicate= s the - * next group that the allocator intends to use for allocation.= As - * output, this field indicates the next group that should be u= sed as - * determined by the optimization functions. - * @ngroups Total number of groups */ -static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, - enum criteria *new_cr, ext4_group_t *group, ext4_group_t ngroups) +static int ext4_mb_scan_groups(struct ext4_allocation_context *ac) { - *new_cr =3D ac->ac_criteria; + int ret =3D 0; + ext4_group_t start; + struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); + ext4_group_t ngroups =3D ext4_get_groups_count(ac->ac_sb); =20 - if (!should_optimize_scan(ac)) { - *group =3D next_linear_group(*group, ngroups); - return; - } + /* non-extent files are limited to low blocks/groups */ + if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) + ngroups =3D sbi->s_blockfile_groups; + + /* searching for the right group start from the goal value specified */ + start =3D ac->ac_g_ex.fe_group; + ac->ac_prefetch_grp =3D start; + ac->ac_prefetch_nr =3D 0; + + if (!should_optimize_scan(ac)) + return ext4_mb_scan_groups_linear(ac, ngroups, &start, ngroups); =20 /* * Optimized scanning can return non adjacent groups which can cause * seek overhead for rotational disks. So try few linear groups before * trying optimized scan. */ - if (ac->ac_groups_linear_remaining) { - *group =3D next_linear_group(*group, ngroups); - ac->ac_groups_linear_remaining--; - return; - } + if (sbi->s_mb_max_linear_groups) + ret =3D ext4_mb_scan_groups_linear(ac, ngroups, &start, + sbi->s_mb_max_linear_groups); + if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) + return ret; =20 - if (*new_cr =3D=3D CR_POWER2_ALIGNED) { - ext4_mb_choose_next_group_p2_aligned(ac, new_cr, group); - } else if (*new_cr =3D=3D CR_GOAL_LEN_FAST) { - ext4_mb_choose_next_group_goal_fast(ac, new_cr, group); - } else if (*new_cr =3D=3D CR_BEST_AVAIL_LEN) { - ext4_mb_choose_next_group_best_avail(ac, new_cr, group); - } else { + switch (ac->ac_criteria) { + case CR_POWER2_ALIGNED: + return ext4_mb_scan_groups_p2_aligned(ac, start); + case CR_GOAL_LEN_FAST: + return ext4_mb_scan_groups_goal_fast(ac, start); + case CR_BEST_AVAIL_LEN: + return ext4_mb_scan_groups_best_avail(ac, start); + default: /* * TODO: For CR_GOAL_LEN_SLOW, we can arrange groups in an * rb tree sorted by bb_free. But until that happens, we should @@ -1170,6 +1194,8 @@ static void ext4_mb_choose_next_group(struct ext4_all= ocation_context *ac, */ WARN_ON(1); } + + return 0; } =20 /* @@ -2875,6 +2901,18 @@ void ext4_mb_prefetch_fini(struct super_block *sb, e= xt4_group_t group, } } =20 +static inline void ac_inc_bad_suggestions(struct ext4_allocation_context *= ac) +{ + struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); + + if (ac->ac_flags & EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED) + atomic_inc(&sbi->s_bal_p2_aligned_bad_suggestions); + else if (ac->ac_flags & EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED) + atomic_inc(&sbi->s_bal_goal_fast_bad_suggestions); + else if (ac->ac_flags & EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED) + atomic_inc(&sbi->s_bal_best_avail_bad_suggestions); +} + static int ext4_mb_scan_group(struct ext4_allocation_context *ac, ext4_group_t group) { @@ -2893,7 +2931,8 @@ static int ext4_mb_scan_group(struct ext4_allocation_= context *ac, if (ret <=3D 0) { if (!ac->ac_first_err) ac->ac_first_err =3D ret; - return 0; + ret =3D 0; + goto out; } =20 ret =3D ext4_mb_load_buddy(sb, group, ac->ac_e4b); @@ -2916,26 +2955,20 @@ static int ext4_mb_scan_group(struct ext4_allocatio= n_context *ac, ext4_unlock_group(sb, group); out_unload: ext4_mb_unload_buddy(ac->ac_e4b); +out: + if (EXT4_SB(sb)->s_mb_stats && ac->ac_status =3D=3D AC_STATUS_CONTINUE) + ac_inc_bad_suggestions(ac); return ret; } =20 static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { - ext4_group_t ngroups, group, i; - enum criteria new_cr, cr =3D CR_GOAL_LEN_FAST; + ext4_group_t i; int err =3D 0; - struct ext4_sb_info *sbi; - struct super_block *sb; + struct super_block *sb =3D ac->ac_sb; + struct ext4_sb_info *sbi =3D EXT4_SB(sb); struct ext4_buddy e4b; - int lost; - - sb =3D ac->ac_sb; - sbi =3D EXT4_SB(sb); - ngroups =3D ext4_get_groups_count(sb); - /* non-extent files are limited to low blocks/groups */ - if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) - ngroups =3D sbi->s_blockfile_groups; =20 BUG_ON(ac->ac_status =3D=3D AC_STATUS_FOUND); =20 @@ -2980,48 +3013,21 @@ ext4_mb_regular_allocator(struct ext4_allocation_co= ntext *ac) * start with CR_GOAL_LEN_FAST, unless it is power of 2 * aligned, in which case let's do that faster approach first. */ + ac->ac_criteria =3D CR_GOAL_LEN_FAST; if (ac->ac_2order) - cr =3D CR_POWER2_ALIGNED; + ac->ac_criteria =3D CR_POWER2_ALIGNED; =20 ac->ac_e4b =3D &e4b; ac->ac_prefetch_ios =3D 0; ac->ac_first_err =3D 0; repeat: - for (; cr < EXT4_MB_NUM_CRS && ac->ac_status =3D=3D AC_STATUS_CONTINUE; c= r++) { - ac->ac_criteria =3D cr; - /* - * searching for the right group start - * from the goal value specified - */ - group =3D ac->ac_g_ex.fe_group; - ac->ac_groups_linear_remaining =3D sbi->s_mb_max_linear_groups; - ac->ac_prefetch_grp =3D group; - ac->ac_prefetch_nr =3D 0; - - for (i =3D 0, new_cr =3D cr; i < ngroups; i++, - ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { - - cond_resched(); - if (new_cr !=3D cr) { - cr =3D new_cr; - goto repeat; - } - - err =3D ext4_mb_scan_group(ac, group); - if (err) - goto out; - - if (ac->ac_status !=3D AC_STATUS_CONTINUE) - break; - } - /* Processed all groups and haven't found blocks */ - if (sbi->s_mb_stats && i =3D=3D ngroups) - atomic64_inc(&sbi->s_bal_cX_failed[cr]); + while (ac->ac_criteria < EXT4_MB_NUM_CRS) { + err =3D ext4_mb_scan_groups(ac); + if (err) + goto out; =20 - if (i =3D=3D ngroups && ac->ac_criteria =3D=3D CR_BEST_AVAIL_LEN) - /* Reset goal length to original goal length before - * falling into CR_GOAL_LEN_SLOW */ - ac->ac_g_ex.fe_len =3D ac->ac_orig_goal_len; + if (ac->ac_status !=3D AC_STATUS_CONTINUE) + break; } =20 if (ac->ac_b_ex.fe_len > 0 && ac->ac_status !=3D AC_STATUS_FOUND && @@ -3032,6 +3038,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) */ ext4_mb_try_best_found(ac, &e4b); if (ac->ac_status !=3D AC_STATUS_FOUND) { + int lost; + /* * Someone more lucky has already allocated it. * The only thing we can do is just take first @@ -3047,7 +3055,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) ac->ac_b_ex.fe_len =3D 0; ac->ac_status =3D AC_STATUS_CONTINUE; ac->ac_flags |=3D EXT4_MB_HINT_FIRST; - cr =3D CR_ANY_FREE; + ac->ac_criteria =3D CR_ANY_FREE; goto repeat; } } @@ -3060,7 +3068,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_cont= ext *ac) =20 mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr= %d ret %d\n", ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status, - ac->ac_flags, cr, err); + ac->ac_flags, ac->ac_criteria, err); =20 if (ac->ac_prefetch_nr) ext4_mb_prefetch_fini(sb, ac->ac_prefetch_grp, ac->ac_prefetch_nr); diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 721aaea1f83e..65713b847385 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -208,7 +208,6 @@ struct ext4_allocation_context { int ac_first_err; =20 __u32 ac_flags; /* allocation hints */ - __u32 ac_groups_linear_remaining; __u16 ac_groups_scanned; __u16 ac_found; __u16 ac_cX_found[EXT4_MB_NUM_CRS]; --=20 2.46.1 From nobody Wed Oct 8 23:44:11 2025 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A0386226CE1; Mon, 23 Jun 2025 07:47:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.191 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664832; cv=none; b=dIJPqQUH/bWgwS4Wp6lWY9dyWhuUiFFWevxEl8dffEpfV+wjhwo25iOj+au0JfImM3Eq1wukVxg0j3RZoYkNGF/VQhwBf3g8+obMomBKQmn8IICsfsd8WzIHUrMzxWU++cqRbgB4edOsm2xzAkF5+zFfGmihifADid3Up+yUD+E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1750664832; c=relaxed/simple; bh=/3XoxWpy1bvkEBHAB3eTe7ji9Z2uO/Fts96126NwKyQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QCJS9vmcFCLBt6oayCzKx8ty3T7JoeJuOT7mGTnx3lYKW0eEyeZkLUTtFD07E3P/svQiBOaZ+A+yq7xrgEANd+Sr9lkSCvx1i2J72M5moFmsTdtsJeRNqrAnVvx+GyR8CdrvhXhTUuexYKI2J9+Qv7unjwq5q4FeM0Z3T+P37/s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.191 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.88.163]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4bQg9V0zD5z2BdVj; Mon, 23 Jun 2025 15:45:34 +0800 (CST) Received: from dggpemf500013.china.huawei.com (unknown [7.185.36.188]) by mail.maildlp.com (Postfix) with ESMTPS id 4F8E818005F; Mon, 23 Jun 2025 15:47:08 +0800 (CST) Received: from huawei.com (10.175.112.188) by dggpemf500013.china.huawei.com (7.185.36.188) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Mon, 23 Jun 2025 15:47:07 +0800 From: Baokun Li To: CC: , , , , , , , Subject: [PATCH v2 16/16] ext4: ensure global ordered traversal across all free groups xarrays Date: Mon, 23 Jun 2025 15:33:04 +0800 Message-ID: <20250623073304.3275702-17-libaokun1@huawei.com> X-Mailer: git-send-email 2.46.1 In-Reply-To: <20250623073304.3275702-1-libaokun1@huawei.com> References: <20250623073304.3275702-1-libaokun1@huawei.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: kwepems200001.china.huawei.com (7.221.188.67) To dggpemf500013.china.huawei.com (7.185.36.188) Content-Type: text/plain; charset="utf-8" Although we now perform ordered traversal within an xarray, this is currently limited to a single xarray, traversing right then left. However, we have multiple such xarrays, which prevents us from guaranteeing a linear-like traversal where all groups on the right are visited before all groups on the left. Therefore, this change modifies the traversal to first iterate through all right groups across all xarrays, and then all left groups across all xarrays. This achieves a linear-like effect, mitigating contention between block allocation and block freeing paths. Performance test data follows: CPU: Kunpeng 920 | P80 | P1 | Memory: 512GB |-------------------------|-------------------------| Disk: 960GB SSD | base | patched | base | patched | -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 20976 | 20619 (-1.7%) | 319396| 299238 (-6.3%) | mb_optimize_scan=3D1 | 14580 | 20119 (+37.9%) | 319237| 315268 (-1.2%) | CPU: AMD 9654 * 2 | P96 | P1 | Memory: 1536GB |-------------------------|-------------------------| Disk: 960GB SSD | base | patched | base | patched | -------------------|-------|-----------------|-------|-----------------| mb_optimize_scan=3D0 | 51713 | 51983 (+0.5%) | 206655| 207033 (0.18%) | mb_optimize_scan=3D1 | 35527 | 48486 (+36.4%) | 212574| 202415 (+4.7%) | Signed-off-by: Baokun Li --- fs/ext4/mballoc.c | 69 ++++++++++++++++++++++++++++++++--------------- 1 file changed, 47 insertions(+), 22 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index d8372a649a0c..d26a0e8e3f7e 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -877,22 +877,20 @@ mb_update_avg_fragment_size(struct super_block *sb, s= truct ext4_group_info *grp) grp->bb_group, new, ret); } =20 -static int ext4_mb_scan_groups_xarray(struct ext4_allocation_context *ac, - struct xarray *xa, ext4_group_t start) +static int ext4_mb_scan_groups_xa_range(struct ext4_allocation_context *ac, + struct xarray *xa, + ext4_group_t start, ext4_group_t end) { struct super_block *sb =3D ac->ac_sb; struct ext4_sb_info *sbi =3D EXT4_SB(sb); enum criteria cr =3D ac->ac_criteria; ext4_group_t ngroups =3D ext4_get_groups_count(sb); unsigned long group =3D start; - ext4_group_t end; struct ext4_group_info *grp; =20 - if (WARN_ON_ONCE(start >=3D ngroups)) + if (WARN_ON_ONCE(end >=3D ngroups || start > end)) return 0; - end =3D ngroups - 1; =20 -wrap_around: xa_for_each_range(xa, group, grp, start, end) { int err; =20 @@ -906,28 +904,23 @@ static int ext4_mb_scan_groups_xarray(struct ext4_all= ocation_context *ac, cond_resched(); } =20 - if (start) { - end =3D start - 1; - start =3D 0; - goto wrap_around; - } - return 0; } =20 /* * Find a suitable group of given order from the largest free orders xarra= y. */ -static int -ext4_mb_scan_groups_largest_free_order(struct ext4_allocation_context *ac, - int order, ext4_group_t start) +static inline int +ext4_mb_scan_groups_largest_free_order_range(struct ext4_allocation_contex= t *ac, + int order, ext4_group_t start, + ext4_group_t end) { struct xarray *xa =3D &EXT4_SB(ac->ac_sb)->s_mb_largest_free_orders[order= ]; =20 if (xa_empty(xa)) return 0; =20 - return ext4_mb_scan_groups_xarray(ac, xa, start); + return ext4_mb_scan_groups_xa_range(ac, xa, start, end - 1); } =20 /* @@ -940,13 +933,23 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4= _allocation_context *ac, struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); int i; int ret =3D 0; + ext4_group_t start, end; =20 ac->ac_flags |=3D EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED; + start =3D group; + end =3D ext4_get_groups_count(ac->ac_sb); +wrap_around: for (i =3D ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - ret =3D ext4_mb_scan_groups_largest_free_order(ac, i, group); + ret =3D ext4_mb_scan_groups_largest_free_order_range(ac, i, + start, end); if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) goto out; } + if (start) { + end =3D start; + start =3D 0; + goto wrap_around; + } =20 if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); @@ -961,15 +964,17 @@ static int ext4_mb_scan_groups_p2_aligned(struct ext4= _allocation_context *ac, /* * Find a suitable group of given order from the average fragments xarray. */ -static int ext4_mb_scan_groups_avg_frag_order(struct ext4_allocation_conte= xt *ac, - int order, ext4_group_t start) +static int +ext4_mb_scan_groups_avg_frag_order_range(struct ext4_allocation_context *a= c, + int order, ext4_group_t start, + ext4_group_t end) { struct xarray *xa =3D &EXT4_SB(ac->ac_sb)->s_mb_avg_fragment_size[order]; =20 if (xa_empty(xa)) return 0; =20 - return ext4_mb_scan_groups_xarray(ac, xa, start); + return ext4_mb_scan_groups_xa_range(ac, xa, start, end - 1); } =20 /* @@ -981,14 +986,24 @@ static int ext4_mb_scan_groups_goal_fast(struct ext4_= allocation_context *ac, { struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); int i, ret =3D 0; + ext4_group_t start, end; =20 ac->ac_flags |=3D EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED; + start =3D group; + end =3D ext4_get_groups_count(ac->ac_sb); +wrap_around: i =3D mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); for (; i < MB_NUM_ORDERS(ac->ac_sb); i++) { - ret =3D ext4_mb_scan_groups_avg_frag_order(ac, i, group); + ret =3D ext4_mb_scan_groups_avg_frag_order_range(ac, i, + start, end); if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) goto out; } + if (start) { + end =3D start; + start =3D 0; + goto wrap_around; + } =20 if (sbi->s_mb_stats) atomic64_inc(&sbi->s_bal_cX_failed[ac->ac_criteria]); @@ -1025,6 +1040,7 @@ static int ext4_mb_scan_groups_best_avail(struct ext4= _allocation_context *ac, struct ext4_sb_info *sbi =3D EXT4_SB(ac->ac_sb); int i, order, min_order; unsigned long num_stripe_clusters =3D 0; + ext4_group_t start, end; =20 /* * mb_avg_fragment_size_order() returns order in a way that makes @@ -1057,6 +1073,9 @@ static int ext4_mb_scan_groups_best_avail(struct ext4= _allocation_context *ac, min_order =3D fls(ac->ac_o_ex.fe_len); =20 ac->ac_flags |=3D EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED; + start =3D group; + end =3D ext4_get_groups_count(ac->ac_sb); +wrap_around: for (i =3D order; i >=3D min_order; i--) { int frag_order; /* @@ -1079,10 +1098,16 @@ static int ext4_mb_scan_groups_best_avail(struct ex= t4_allocation_context *ac, frag_order =3D mb_avg_fragment_size_order(ac->ac_sb, ac->ac_g_ex.fe_len); =20 - ret =3D ext4_mb_scan_groups_avg_frag_order(ac, frag_order, group); + ret =3D ext4_mb_scan_groups_avg_frag_order_range(ac, frag_order, + start, end); if (ret || ac->ac_status !=3D AC_STATUS_CONTINUE) goto out; } + if (start) { + end =3D start; + start =3D 0; + goto wrap_around; + } =20 /* Reset goal length to original goal length before falling into CR_GOAL_= LEN_SLOW */ ac->ac_g_ex.fe_len =3D ac->ac_orig_goal_len; --=20 2.46.1