[v2] ext4: better scalability for ext4 block allocation

[PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Baokun Li 3 months, 2 weeks ago

When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.

However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.

To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the CPU count rounded up to the
nearest power of 2. To ensure a consistent goal for each inode, we select
the corresponding goal by taking the inode number modulo the total number
of goals.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

                   | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
 Disk: 960GB SSD   |-------------------------|-------------------------|
                   | base  |    patched      | base  |    patched      |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/ext4.h    |  2 +-
 fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++-------
 fs/ext4/mballoc.h |  9 +++++++++
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 93f03d8c3dca..c3f16aba7b79 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1630,7 +1630,7 @@ struct ext4_sb_info {
 	unsigned int s_mb_group_prealloc;
 	unsigned int s_max_dir_size_kb;
 	/* where last allocation was done - for stream allocation */
-	ext4_group_t s_mb_last_group;
+	ext4_group_t *s_mb_last_groups;
 	unsigned int s_mb_prefetch;
 	unsigned int s_mb_prefetch_limit;
 	unsigned int s_mb_best_avail_max_trim_order;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 3f103919868b..216b332a5054 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2168,9 +2168,12 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	ac->ac_buddy_folio = e4b->bd_buddy_folio;
 	folio_get(ac->ac_buddy_folio);
 	/* store last allocated for subsequent stream allocation */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
-		/* pairs with smp_load_acquire in ext4_mb_regular_allocator() */
-		smp_store_release(&sbi->s_mb_last_group, ac->ac_f_ex.fe_group);
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
+		int hash = ac->ac_inode->i_ino % MB_LAST_GROUPS;
+		/* Pairs with smp_load_acquire in ext4_mb_regular_allocator() */
+		smp_store_release(&sbi->s_mb_last_groups[hash],
+				  ac->ac_f_ex.fe_group);
+	}
 	/*
 	 * As we've just preallocated more space than
 	 * user requested originally, we store allocated
@@ -2842,9 +2845,12 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	}
 
 	/* if stream allocation is enabled, use global goal */
-	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC)
-		/* pairs with smp_store_release in ext4_mb_use_best_found() */
-		ac->ac_g_ex.fe_group = smp_load_acquire(&sbi->s_mb_last_group);
+	if (ac->ac_flags & EXT4_MB_STREAM_ALLOC) {
+		int hash = ac->ac_inode->i_ino % MB_LAST_GROUPS;
+		/* Pairs with smp_store_release in ext4_mb_use_best_found() */
+		ac->ac_g_ex.fe_group = smp_load_acquire(
+						&sbi->s_mb_last_groups[hash]);
+	}
 
 	/*
 	 * Let's just scan groups to find more-less suitable blocks We
@@ -3715,10 +3721,17 @@ int ext4_mb_init(struct super_block *sb)
 			sbi->s_mb_group_prealloc, EXT4_NUM_B2C(sbi, sbi->s_stripe));
 	}
 
+	sbi->s_mb_last_groups = kcalloc(MB_LAST_GROUPS, sizeof(ext4_group_t),
+					GFP_KERNEL);
+	if (sbi->s_mb_last_groups == NULL) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
 	sbi->s_locality_groups = alloc_percpu(struct ext4_locality_group);
 	if (sbi->s_locality_groups == NULL) {
 		ret = -ENOMEM;
-		goto out;
+		goto out_free_last_groups;
 	}
 	for_each_possible_cpu(i) {
 		struct ext4_locality_group *lg;
@@ -3743,6 +3756,9 @@ int ext4_mb_init(struct super_block *sb)
 out_free_locality_groups:
 	free_percpu(sbi->s_locality_groups);
 	sbi->s_locality_groups = NULL;
+out_free_last_groups:
+	kvfree(sbi->s_mb_last_groups);
+	sbi->s_mb_last_groups = NULL;
 out:
 	kfree(sbi->s_mb_avg_fragment_size);
 	kfree(sbi->s_mb_avg_fragment_size_locks);
@@ -3847,6 +3863,7 @@ void ext4_mb_release(struct super_block *sb)
 	}
 
 	free_percpu(sbi->s_locality_groups);
+	kvfree(sbi->s_mb_last_groups);
 }
 
 static inline int ext4_issue_discard(struct super_block *sb,
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index f8280de3e882..38c37901728d 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -97,6 +97,15 @@
  */
 #define MB_NUM_ORDERS(sb)		((sb)->s_blocksize_bits + 2)
 
+/*
+ * Number of mb last groups
+ */
+#ifdef CONFIG_SMP
+#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
+#else
+#define MB_LAST_GROUPS 1
+#endif
+
 struct ext4_free_data {
 	/* this links the free block information from sb_info */
 	struct list_head		efd_list;
-- 
2.46.1

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Jan Kara 3 months, 1 week ago

On Mon 23-06-25 15:32:52, Baokun Li wrote:
> When allocating data blocks, if the first try (goal allocation) fails and
> stream allocation is on, it tries a global goal starting from the last
> group we used (s_mb_last_group). This helps cluster large files together
> to reduce free space fragmentation, and the data block contiguity also
> accelerates write-back to disk.
> 
> However, when multiple processes allocate blocks, having just one global
> goal means they all fight over the same group. This drastically lowers
> the chances of extents merging and leads to much worse file fragmentation.
> 
> To mitigate this multi-process contention, we now employ multiple global
> goals, with the number of goals being the CPU count rounded up to the
> nearest power of 2. To ensure a consistent goal for each inode, we select
> the corresponding goal by taking the inode number modulo the total number
> of goals.
> 
> Performance test data follows:
> 
> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> Observation: Average fallocate operations per container per second.
> 
>                    | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>  Disk: 960GB SSD   |-------------------------|-------------------------|
>                    | base  |    patched      | base  |    patched      |
> -------------------|-------|-----------------|-------|-----------------|
> mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
> mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
> 
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

...

> +/*
> + * Number of mb last groups
> + */
> +#ifdef CONFIG_SMP
> +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
> +#else
> +#define MB_LAST_GROUPS 1
> +#endif
> +

I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
distribution kernels (it is just a theoretical maximum for the number of
CPUs the kernel can support) which seems like far too much for small
filesystems with say 100 block groups. I'd rather pick the array size like:

min(num_possible_cpus(), sbi->s_groups_count/4)

to

a) don't have too many slots so we still concentrate big allocations in
somewhat limited area of the filesystem (a quarter of block groups here).

b) have at most one slot per CPU the machine hardware can in principle
support.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Baokun Li 3 months, 1 week ago

On 2025/6/28 2:31, Jan Kara wrote:
> On Mon 23-06-25 15:32:52, Baokun Li wrote:
>> When allocating data blocks, if the first try (goal allocation) fails and
>> stream allocation is on, it tries a global goal starting from the last
>> group we used (s_mb_last_group). This helps cluster large files together
>> to reduce free space fragmentation, and the data block contiguity also
>> accelerates write-back to disk.
>>
>> However, when multiple processes allocate blocks, having just one global
>> goal means they all fight over the same group. This drastically lowers
>> the chances of extents merging and leads to much worse file fragmentation.
>>
>> To mitigate this multi-process contention, we now employ multiple global
>> goals, with the number of goals being the CPU count rounded up to the
>> nearest power of 2. To ensure a consistent goal for each inode, we select
>> the corresponding goal by taking the inode number modulo the total number
>> of goals.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>>                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>   Disk: 960GB SSD   |-------------------------|-------------------------|
>>                     | base  |    patched      | base  |    patched      |
>> -------------------|-------|-----------------|-------|-----------------|
>> mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
>> mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
>>
>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
> ...
>
>> +/*
>> + * Number of mb last groups
>> + */
>> +#ifdef CONFIG_SMP
>> +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
>> +#else
>> +#define MB_LAST_GROUPS 1
>> +#endif
>> +
> I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
> distribution kernels (it is just a theoretical maximum for the number of
> CPUs the kernel can support)

nr_cpu_ids is generally equal to num_possible_cpus(). Only when
CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
which represents the maximum number of supported CPUs.

> which seems like far too much for small
> filesystems with say 100 block groups.

It does make sense.

> I'd rather pick the array size like:
>
> min(num_possible_cpus(), sbi->s_groups_count/4)
>
> to
>
> a) don't have too many slots so we still concentrate big allocations in
> somewhat limited area of the filesystem (a quarter of block groups here).
>
> b) have at most one slot per CPU the machine hardware can in principle
> support.
>
> 								Honza

You're right, we should consider the number of block groups when setting
the number of global goals.

However, a server's rootfs can often be quite small, perhaps only tens of
GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
still limit the filesystem's scalability. Furthermore, after supporting
LBS, the number of block groups will sharply decrease.

How about we directly use sbi->s_groups_count (which would effectively be
min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
avoid zero values.


Cheers,
Baokun

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Jan Kara 3 months, 1 week ago

On Mon 30-06-25 14:50:30, Baokun Li wrote:
> On 2025/6/28 2:31, Jan Kara wrote:
> > On Mon 23-06-25 15:32:52, Baokun Li wrote:
> > > When allocating data blocks, if the first try (goal allocation) fails and
> > > stream allocation is on, it tries a global goal starting from the last
> > > group we used (s_mb_last_group). This helps cluster large files together
> > > to reduce free space fragmentation, and the data block contiguity also
> > > accelerates write-back to disk.
> > > 
> > > However, when multiple processes allocate blocks, having just one global
> > > goal means they all fight over the same group. This drastically lowers
> > > the chances of extents merging and leads to much worse file fragmentation.
> > > 
> > > To mitigate this multi-process contention, we now employ multiple global
> > > goals, with the number of goals being the CPU count rounded up to the
> > > nearest power of 2. To ensure a consistent goal for each inode, we select
> > > the corresponding goal by taking the inode number modulo the total number
> > > of goals.
> > > 
> > > Performance test data follows:
> > > 
> > > Test: Running will-it-scale/fallocate2 on CPU-bound containers.
> > > Observation: Average fallocate operations per container per second.
> > > 
> > >                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
> > >   Disk: 960GB SSD   |-------------------------|-------------------------|
> > >                     | base  |    patched      | base  |    patched      |
> > > -------------------|-------|-----------------|-------|-----------------|
> > > mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
> > > mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
> > > 
> > > Signed-off-by: Baokun Li <libaokun1@huawei.com>
> > ...
> > 
> > > +/*
> > > + * Number of mb last groups
> > > + */
> > > +#ifdef CONFIG_SMP
> > > +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
> > > +#else
> > > +#define MB_LAST_GROUPS 1
> > > +#endif
> > > +
> > I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
> > distribution kernels (it is just a theoretical maximum for the number of
> > CPUs the kernel can support)
> 
> nr_cpu_ids is generally equal to num_possible_cpus(). Only when
> CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
> which represents the maximum number of supported CPUs.

Indeed, CONFIG_FORCE_NR_CPUS confused me.

> > which seems like far too much for small
> > filesystems with say 100 block groups.
> 
> It does make sense.
> 
> > I'd rather pick the array size like:
> > 
> > min(num_possible_cpus(), sbi->s_groups_count/4)
> > 
> > to
> > 
> > a) don't have too many slots so we still concentrate big allocations in
> > somewhat limited area of the filesystem (a quarter of block groups here).
> > 
> > b) have at most one slot per CPU the machine hardware can in principle
> > support.
> > 
> > 								Honza
> 
> You're right, we should consider the number of block groups when setting
> the number of global goals.
> 
> However, a server's rootfs can often be quite small, perhaps only tens of
> GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
> still limit the filesystem's scalability.

I would not expect such root filesystem to be loaded by many big
allocations in parallel :). And with 4k blocksize 32GB filesystem would
have already 64 goals which doesn't seem *that* limiting?

Also note that as the filesystem is filling up and the free space is getting
fragmented, the number of groups where large allocation can succeed will
reduce. Thus regardless of how many slots for streaming goal you have, they
will all end up pointing only to those several groups where large
still allocation succeeds. So although large number of slots looks good for
an empty filesystem, the benefit for aged filesystem is diminishing and
larger number of slots will make the fs fragment faster.

> Furthermore, after supporting LBS, the number of block groups will
> sharply decrease.

Right. This is going to reduce scalability of block allocation in general.
Also as the groups grow larger with larger blocksize the benefit of
streaming allocation which just gives a hint about block group to use is
going to diminish when the free block search will be always starting from
0. We will maybe need to store ext4_fsblk_t (effectively combining
group+offset in a single atomic unit) as a streaming goal to mitigate this.

> How about we directly use sbi->s_groups_count (which would effectively be
> min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
> avoid zero values.

Avoiding zero values is definitely a good point. My concern is that if we
have sb->s_groups_count streaming goals, then practically each group will
become a streaming goal group and thus we can just remove the streaming
allocation altogether, there's no benefit.

We could make streaming goal to be ext4_fsblk_t so that also offset of the
last big allocation in the group is recorded as I wrote above. That would
tend to pack big allocations in each group together which is benefitial to
combat fragmentation even with higher proportion of groups that are streaming
goals (and likely becomes more important as the blocksize and thus group
size grow). We can discuss proper number of slots for streaming allocation
(I'm not hung up on it being quarter of the group count) but I'm convinced
sb->s_groups_count is too much :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Baokun Li 3 months, 1 week ago

On 2025/6/30 16:38, Jan Kara wrote:
> On Mon 30-06-25 14:50:30, Baokun Li wrote:
>> On 2025/6/28 2:31, Jan Kara wrote:
>>> On Mon 23-06-25 15:32:52, Baokun Li wrote:
>>>> When allocating data blocks, if the first try (goal allocation) fails and
>>>> stream allocation is on, it tries a global goal starting from the last
>>>> group we used (s_mb_last_group). This helps cluster large files together
>>>> to reduce free space fragmentation, and the data block contiguity also
>>>> accelerates write-back to disk.
>>>>
>>>> However, when multiple processes allocate blocks, having just one global
>>>> goal means they all fight over the same group. This drastically lowers
>>>> the chances of extents merging and leads to much worse file fragmentation.
>>>>
>>>> To mitigate this multi-process contention, we now employ multiple global
>>>> goals, with the number of goals being the CPU count rounded up to the
>>>> nearest power of 2. To ensure a consistent goal for each inode, we select
>>>> the corresponding goal by taking the inode number modulo the total number
>>>> of goals.
>>>>
>>>> Performance test data follows:
>>>>
>>>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>>>> Observation: Average fallocate operations per container per second.
>>>>
>>>>                      | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>>>    Disk: 960GB SSD   |-------------------------|-------------------------|
>>>>                      | base  |    patched      | base  |    patched      |
>>>> -------------------|-------|-----------------|-------|-----------------|
>>>> mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
>>>> mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
>>>>
>>>> Signed-off-by: Baokun Li <libaokun1@huawei.com>
>>> ...
>>>
>>>> +/*
>>>> + * Number of mb last groups
>>>> + */
>>>> +#ifdef CONFIG_SMP
>>>> +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
>>>> +#else
>>>> +#define MB_LAST_GROUPS 1
>>>> +#endif
>>>> +
>>> I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
>>> distribution kernels (it is just a theoretical maximum for the number of
>>> CPUs the kernel can support)
>> nr_cpu_ids is generally equal to num_possible_cpus(). Only when
>> CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
>> which represents the maximum number of supported CPUs.
> Indeed, CONFIG_FORCE_NR_CPUS confused me.
>
>>> which seems like far too much for small
>>> filesystems with say 100 block groups.
>> It does make sense.
>>
>>> I'd rather pick the array size like:
>>>
>>> min(num_possible_cpus(), sbi->s_groups_count/4)
>>>
>>> to
>>>
>>> a) don't have too many slots so we still concentrate big allocations in
>>> somewhat limited area of the filesystem (a quarter of block groups here).
>>>
>>> b) have at most one slot per CPU the machine hardware can in principle
>>> support.
>>>
>>> 								Honza
>> You're right, we should consider the number of block groups when setting
>> the number of global goals.
>>
>> However, a server's rootfs can often be quite small, perhaps only tens of
>> GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
>> still limit the filesystem's scalability.
> I would not expect such root filesystem to be loaded by many big
> allocations in parallel :). And with 4k blocksize 32GB filesystem would
> have already 64 goals which doesn't seem *that* limiting?

Docker's default path is on the rootfs. Our rootfs size is typically 70GB,
but we might have 300+ or even 500+ CPUs. This could lead to scalability
issues in certain specific scenarios. However, in general,
sbi->s_groups_count / 4 does appear to be sufficient.

> Also note that as the filesystem is filling up and the free space is getting
> fragmented, the number of groups where large allocation can succeed will
> reduce. Thus regardless of how many slots for streaming goal you have, they
> will all end up pointing only to those several groups where large
> still allocation succeeds. So although large number of slots looks good for
> an empty filesystem, the benefit for aged filesystem is diminishing and
> larger number of slots will make the fs fragment faster.
I don't think so. Although we're now splitting into multiple goals, these
goals all start from zero. This means 'n' goals will cause us to scan all
groups 'n' times. We'll repeatedly search for free space on disk rather
than creating more fragmentation.

This approach can actually solve the issue where a single goal, despite
having 4K free space available, causes an 8K allocation request to skip it,
forcing subsequent 4K allocation requests to split larger free spaces.
>
>> Furthermore, after supporting LBS, the number of block groups will
>> sharply decrease.
> Right. This is going to reduce scalability of block allocation in general.
> Also as the groups grow larger with larger blocksize the benefit of
> streaming allocation which just gives a hint about block group to use is
> going to diminish when the free block search will be always starting from
> 0. We will maybe need to store ext4_fsblk_t (effectively combining
> group+offset in a single atomic unit) as a streaming goal to mitigate this.
I don't think that's necessary. We still need to consider block group lock
contention, so the smallest unit should always be the group.
>
>> How about we directly use sbi->s_groups_count (which would effectively be
>> min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
>> avoid zero values.
> Avoiding zero values is definitely a good point. My concern is that if we
> have sb->s_groups_count streaming goals, then practically each group will
> become a streaming goal group and thus we can just remove the streaming
> allocation altogether, there's no benefit.
Having 'n' goals simply means we scan the groups 'n' times; it's not
related to the number of groups. However, when there are too many goals,
the probability of contention due to identical goals increases.
Nevertheless, this is always better than having a single goal, where they
would always contend for the same one.

Now that we're hashing based on an inode's ino, we can later specify the
corresponding inode ino based on the CPU ID during inode allocation.
>
> We could make streaming goal to be ext4_fsblk_t so that also offset of the
> last big allocation in the group is recorded as I wrote above. That would
> tend to pack big allocations in each group together which is benefitial to
> combat fragmentation even with higher proportion of groups that are streaming
> goals (and likely becomes more important as the blocksize and thus group
> size grow). We can discuss proper number of slots for streaming allocation
> (I'm not hung up on it being quarter of the group count) but I'm convinced
> sb->s_groups_count is too much :)
>
> 								Honza

I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
believe recording offsets is necessary. As groups become larger,
contention for groups will intensify, and adding offsets would only
make this contention worse.


Regards,
Baokun

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Jan Kara 3 months, 1 week ago

On Mon 30-06-25 18:02:49, Baokun Li wrote:
> On 2025/6/30 16:38, Jan Kara wrote:
> > We could make streaming goal to be ext4_fsblk_t so that also offset of the
> > last big allocation in the group is recorded as I wrote above. That would
> > tend to pack big allocations in each group together which is benefitial to
> > combat fragmentation even with higher proportion of groups that are streaming
> > goals (and likely becomes more important as the blocksize and thus group
> > size grow). We can discuss proper number of slots for streaming allocation
> > (I'm not hung up on it being quarter of the group count) but I'm convinced
> > sb->s_groups_count is too much :)
> > 
> > 								Honza
> 
> I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
> believe recording offsets is necessary. As groups become larger,
> contention for groups will intensify, and adding offsets would only
> make this contention worse.

I agree the contention for groups will increase when the group count goes
down. I just thought offsets may help to find free space faster in large
groups (and thus reduce contention) and also reduce free space
fragmentation within a group (by having higher chances of placing large
allocations close together within a group) but maybe that's not the case.
Offsets are definitely not requirement at this point.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Baokun Li 3 months, 1 week ago

On 2025/7/1 1:41, Jan Kara wrote:
> On Mon 30-06-25 18:02:49, Baokun Li wrote:
>> On 2025/6/30 16:38, Jan Kara wrote:
>>> We could make streaming goal to be ext4_fsblk_t so that also offset of the
>>> last big allocation in the group is recorded as I wrote above. That would
>>> tend to pack big allocations in each group together which is benefitial to
>>> combat fragmentation even with higher proportion of groups that are streaming
>>> goals (and likely becomes more important as the blocksize and thus group
>>> size grow). We can discuss proper number of slots for streaming allocation
>>> (I'm not hung up on it being quarter of the group count) but I'm convinced
>>> sb->s_groups_count is too much :)
>>>
>>> 								Honza
>> I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
>> believe recording offsets is necessary. As groups become larger,
>> contention for groups will intensify, and adding offsets would only
>> make this contention worse.
> I agree the contention for groups will increase when the group count goes
> down. I just thought offsets may help to find free space faster in large
> groups (and thus reduce contention) and also reduce free space
> fragmentation within a group (by having higher chances of placing large
> allocations close together within a group) but maybe that's not the case.
> Offsets are definitely not requirement at this point.
>
> 								Honza
>
Thinking this over, with LBS support coming, if our block size jumps from
4KB to 64KB, the maximum group size will dramatically increase from 128MB
to 32GB (even with the current 4GB group limit). If free space within a
group gets heavily fragmented, iterating through that single group could
become quite time-consuming.

Your idea of recording offsets to prevent redundant scanning of
already-checked extents within a group definitely makes sense. But with
reference to the idea of optimizing linear traversal of groups, I think it
might be better to record the offset of the first occurrence of each order
in the same way that bb_counters records the number of each order.


Cheers,
Baokun

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Jan Kara 3 months, 1 week ago

On Tue 01-07-25 11:32:23, Baokun Li wrote:
> On 2025/7/1 1:41, Jan Kara wrote:
> > On Mon 30-06-25 18:02:49, Baokun Li wrote:
> > > On 2025/6/30 16:38, Jan Kara wrote:
> > > > We could make streaming goal to be ext4_fsblk_t so that also offset of the
> > > > last big allocation in the group is recorded as I wrote above. That would
> > > > tend to pack big allocations in each group together which is benefitial to
> > > > combat fragmentation even with higher proportion of groups that are streaming
> > > > goals (and likely becomes more important as the blocksize and thus group
> > > > size grow). We can discuss proper number of slots for streaming allocation
> > > > (I'm not hung up on it being quarter of the group count) but I'm convinced
> > > > sb->s_groups_count is too much :)
> > > > 
> > > > 								Honza
> > > I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
> > > believe recording offsets is necessary. As groups become larger,
> > > contention for groups will intensify, and adding offsets would only
> > > make this contention worse.
> > I agree the contention for groups will increase when the group count goes
> > down. I just thought offsets may help to find free space faster in large
> > groups (and thus reduce contention) and also reduce free space
> > fragmentation within a group (by having higher chances of placing large
> > allocations close together within a group) but maybe that's not the case.
> > Offsets are definitely not requirement at this point.
> > 
> > 								Honza
> > 
> Thinking this over, with LBS support coming, if our block size jumps from
> 4KB to 64KB, the maximum group size will dramatically increase from 128MB
> to 32GB (even with the current 4GB group limit). If free space within a
> group gets heavily fragmented, iterating through that single group could
> become quite time-consuming.
> 
> Your idea of recording offsets to prevent redundant scanning of
> already-checked extents within a group definitely makes sense. But with
> reference to the idea of optimizing linear traversal of groups, I think it
> might be better to record the offset of the first occurrence of each order
> in the same way that bb_counters records the number of each order.

Yes, something like that makes sense. But I guess that's a material for the
next patch set :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

Posted by Baokun Li 3 months, 1 week ago

On 2025/7/1 19:53, Jan Kara wrote:
> On Tue 01-07-25 11:32:23, Baokun Li wrote:
>> On 2025/7/1 1:41, Jan Kara wrote:
>>> On Mon 30-06-25 18:02:49, Baokun Li wrote:
>>>> On 2025/6/30 16:38, Jan Kara wrote:
>>>>> We could make streaming goal to be ext4_fsblk_t so that also offset of the
>>>>> last big allocation in the group is recorded as I wrote above. That would
>>>>> tend to pack big allocations in each group together which is benefitial to
>>>>> combat fragmentation even with higher proportion of groups that are streaming
>>>>> goals (and likely becomes more important as the blocksize and thus group
>>>>> size grow). We can discuss proper number of slots for streaming allocation
>>>>> (I'm not hung up on it being quarter of the group count) but I'm convinced
>>>>> sb->s_groups_count is too much :)
>>>>>
>>>>> 								Honza
>>>> I think sbi->s_groups_count / 4 is indeed acceptable. However, I don't
>>>> believe recording offsets is necessary. As groups become larger,
>>>> contention for groups will intensify, and adding offsets would only
>>>> make this contention worse.
>>> I agree the contention for groups will increase when the group count goes
>>> down. I just thought offsets may help to find free space faster in large
>>> groups (and thus reduce contention) and also reduce free space
>>> fragmentation within a group (by having higher chances of placing large
>>> allocations close together within a group) but maybe that's not the case.
>>> Offsets are definitely not requirement at this point.
>>>
>>> 								Honza
>>>
>> Thinking this over, with LBS support coming, if our block size jumps from
>> 4KB to 64KB, the maximum group size will dramatically increase from 128MB
>> to 32GB (even with the current 4GB group limit). If free space within a
>> group gets heavily fragmented, iterating through that single group could
>> become quite time-consuming.
>>
>> Your idea of recording offsets to prevent redundant scanning of
>> already-checked extents within a group definitely makes sense. But with
>> reference to the idea of optimizing linear traversal of groups, I think it
>> might be better to record the offset of the first occurrence of each order
>> in the same way that bb_counters records the number of each order.
> Yes, something like that makes sense. But I guess that's a material for the
> next patch set :)
>
> 								Honza

Yes, this isn't urgent right now. I plan to implement this idea after
the LBS patch set is complete.

Thank you very much for your review and patient explanations! 😀


Regards,
Baokun