[v3] ext4: better scalability for ext4 block allocation

[PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group

Posted by Baokun Li 2 months, 3 weeks ago

After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.

Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.

Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  2 +-
 fs/ext4/mballoc.c | 12 +++---------
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b83095541c98..7f5c070de0fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1630,7 +1630,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
-	unsigned long s_mb_last_group;
+	ext4_group_t s_mb_last_group;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e3a5103e1620..025b759ca643 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	ac->ac_buddy_folio = e4b->bd_buddy_folio;
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		spin_lock(&sbi->s_md_lock);
-		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
-		spin_unlock(&sbi->s_md_lock);
-	}
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
+		WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
 	/*
 	 * As we've just preallocated more space than
 	 * user requested originally, we store allocated
@@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	/* if stream allocation is enabled, use global goal */
 	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
-		/* TBD: may be hot point */
-		spin_lock(&sbi->s_md_lock);
-		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
-		spin_unlock(&sbi->s_md_lock);
+		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
 		ac->ac_g_ex.fe_start = -1;
 		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
 	}
-- 
2.46.1

Re: [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group

Posted by Ojaswin Mujoo 2 months, 3 weeks ago

On Mon, Jul 14, 2025 at 09:03:14PM +0800, Baokun Li wrote:
> After we optimized the block group lock, we found another lock
> contention issue when running will-it-scale/fallocate2 with multiple
> processes. The fallocate's block allocation and the truncate's block
> release were fighting over the s_md_lock. The problem is, this lock
> protects totally different things in those two processes: the list of
> freed data blocks (s_freed_data_list) when releasing, and where to start
> looking for new blocks (mb_last_group) when allocating.
> 
> Now we only need to track s_mb_last_group and no longer need to track
> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
> two are consistent. Since s_mb_last_group is merely a hint and doesn't
> require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.

Hi Baokun,

So i just got curious of the difference between smp_load_acquire vs
READ_ONCE on PowerPC, another weak memory ordering arch.
Interestingly, I didn't see that big of a single threaded drop.

The number are as follows (mb_opt_scan=1):

100 threads 
w/ smp_load_acquire    1668 MB/s 
w/ READ_ONCE           1599 MB/s

1 thread pinned to 1 cpu
w/ smp_load_acquire    292 MB/s
w/ READ_ONCE           296 MB/s

Either ways, this is much better than the base which is around 500MB/s
but just thought I'd share it here

Feel free to add:
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Regards,
ojaswin
> 
> Besides, the s_mb_last_group data type only requires ext4_group_t
> (i.e., unsigned int), rendering unsigned long superfluous.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
> |CPU: Kunpeng 920   |          P80           |            P1           |
> |Memory: 512GB      |------------------------|-------------------------|
> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
> |mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |
> 
> |CPU: AMD 9654 * 2  |          P96           |             P1          |
> |Memory: 1536GB     |------------------------|-------------------------|
> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
> |-------------------|-------|----------------|--------|----------------|
> |mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
> |mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |
> 
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ---
>  fs/ext4/ext4.h    |  2 +-
>  fs/ext4/mballoc.c | 12 +++---------
>  2 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index b83095541c98..7f5c070de0fb 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1630,7 +1630,7 @@ struct ext4_sb_info {
>  	unsigned int s_mb_group_prealloc;
>  	unsigned int s_max_dir_size_kb;
>  	/* where last allocation was done - for stream allocation */
> -	unsigned long s_mb_last_group;
> +	ext4_group_t s_mb_last_group;
>  	unsigned int s_mb_prefetch;
>  	unsigned int s_mb_prefetch_limit;
>  	unsigned int s_mb_best_avail_max_trim_order;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index e3a5103e1620..025b759ca643 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>  	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>  	folio_get(ac->ac_buddy_folio);
>  	/* store last allocated for subsequent stream allocation */
> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		spin_lock(&sbi->s_md_lock);
> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
> -		spin_unlock(&sbi->s_md_lock);
> -	}
> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
> +		WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>  	/*
>  	 * As we've just preallocated more space than
>  	 * user requested originally, we store allocated
> @@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>  
>  	/* if stream allocation is enabled, use global goal */
>  	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
> -		/* TBD: may be hot point */
> -		spin_lock(&sbi->s_md_lock);
> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
> -		spin_unlock(&sbi->s_md_lock);
> +		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
>  		ac->ac_g_ex.fe_start = -1;
>  		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
>  	}
> -- 
> 2.46.1
>

Re: [PATCH v3 04/17] ext4: remove unnecessary s_md_lock on update s_mb_last_group

Posted by Baokun Li 2 months, 2 weeks ago

On 2025/7/17 21:36, Ojaswin Mujoo wrote:
> On Mon, Jul 14, 2025 at 09:03:14PM +0800, Baokun Li wrote:
>> After we optimized the block group lock, we found another lock
>> contention issue when running will-it-scale/fallocate2 with multiple
>> processes. The fallocate's block allocation and the truncate's block
>> release were fighting over the s_md_lock. The problem is, this lock
>> protects totally different things in those two processes: the list of
>> freed data blocks (s_freed_data_list) when releasing, and where to start
>> looking for new blocks (mb_last_group) when allocating.
>>
>> Now we only need to track s_mb_last_group and no longer need to track
>> s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
>> two are consistent. Since s_mb_last_group is merely a hint and doesn't
>> require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.
> Hi Baokun,
>
> So i just got curious of the difference between smp_load_acquire vs
> READ_ONCE on PowerPC, another weak memory ordering arch.
> Interestingly, I didn't see that big of a single threaded drop.
>
> The number are as follows (mb_opt_scan=1):
>
> 100 threads
> w/ smp_load_acquire    1668 MB/s
> w/ READ_ONCE           1599 MB/s
>
> 1 thread pinned to 1 cpu
> w/ smp_load_acquire    292 MB/s
> w/ READ_ONCE           296 MB/s
>
> Either ways, this is much better than the base which is around 500MB/s
> but just thought I'd share it here

Thank you for providing the test data for PowerPC, it is true that
the results may vary slightly between architectures.

>
> Feel free to add:
> Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
>
Thank you for you review!

Cheers,
Baokun

>> Besides, the s_mb_last_group data type only requires ext4_group_t
>> (i.e., unsigned int), rendering unsigned long superfluous.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>> |CPU: Kunpeng 920   |          P80           |            P1           |
>> |Memory: 512GB      |------------------------|-------------------------|
>> |960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
>> |-------------------|-------|----------------|--------|----------------|
>> |mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
>> |mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |
>>
>> |CPU: AMD 9654 * 2  |          P96           |             P1          |
>> |Memory: 1536GB     |------------------------|-------------------------|
>> |960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
>> |-------------------|-------|----------------|--------|----------------|
>> |mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
>> |mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |
>>
>> Suggested-by: Jan Kara <jack@suse.cz>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>> ---
>>   fs/ext4/ext4.h    |  2 +-
>>   fs/ext4/mballoc.c | 12 +++---------
>>   2 files changed, 4 insertions(+), 10 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index b83095541c98..7f5c070de0fb 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1630,7 +1630,7 @@ struct ext4_sb_info {
>>   	unsigned int s_mb_group_prealloc;
>>   	unsigned int s_max_dir_size_kb;
>>   	/* where last allocation was done - for stream allocation */
>> -	unsigned long s_mb_last_group;
>> +	ext4_group_t s_mb_last_group;
>>   	unsigned int s_mb_prefetch;
>>   	unsigned int s_mb_prefetch_limit;
>>   	unsigned int s_mb_best_avail_max_trim_order;
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index e3a5103e1620..025b759ca643 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -2168,11 +2168,8 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
>>   	ac->ac_buddy_folio = e4b->bd_buddy_folio;
>>   	folio_get(ac->ac_buddy_folio);
>>   	/* store last allocated for subsequent stream allocation */
>> -	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		spin_lock(&sbi->s_md_lock);
>> -		sbi->s_mb_last_group = ac->ac_f_ex.fe_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> -	}
>> +	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
>> +		WRITE_ONCE(sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
>>   	/*
>>   	 * As we've just preallocated more space than
>>   	 * user requested originally, we store allocated
>> @@ -2845,10 +2842,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>>   
>>   	/* if stream allocation is enabled, use global goal */
>>   	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
>> -		/* TBD: may be hot point */
>> -		spin_lock(&sbi->s_md_lock);
>> -		ac->ac_g_ex.fe_group = sbi->s_mb_last_group;
>> -		spin_unlock(&sbi->s_md_lock);
>> +		ac->ac_g_ex.fe_group = READ_ONCE(sbi->s_mb_last_group);
>>   		ac->ac_g_ex.fe_start = -1;
>>   		ac->ac_flags &= ~EXT4_MB_HINT_TRY_GOAL;
>>   	}
>> -- 
>> 2.46.1
>>