fs/btrfs/block-group.c | 6 ++++++ 1 file changed, 6 insertions(+)
Differential analysis of block-group.c shows an inconsistency between
btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes().
When space is reserved, btrfs_use_block_group_size_class() is called to
set a block group's size class, specializing it for a specific allocation
size to reduce fragmentation. However, when these reservations are
subsequently freed (e.g., due to an error or transaction abort),
btrfs_free_reserved_bytes() fails to perform the corresponding cleanup.
This leads to a state where a block group remains stuck with a specific
size class even if it contains no used or reserved bytes. While the
impact depends on the workload, this stale state can cause
find_free_extent to unnecessarily skip these block groups for mismatched
size requests.
In some cases, it may even trigger the allocation of new block groups if
no matching or unassigned block groups are available.
Fix this by resetting the size class to BTRFS_BG_SZ_NONE in
btrfs_free_reserved_bytes() when the block group becomes completely
empty.
Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
---
Changelog:
v2 -> v3:
1. Corrected the "Fixes" tag to 52bb7a2166af.
2. Updated the commit message to reflect that the performance impact is workload-dependent.
3. Added mention that the issue can lead to unnecessary allocation of new block groups.
v1 -> v2:
1. Inlined btrfs_maybe_reset_size_class() function.
2. Moved check below the reserved bytes decrement in btrfs_free_reserved_bytes().
---
fs/btrfs/block-group.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 08b14449fabe..8339ad001d3f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3867,6 +3867,12 @@ void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, u64 num_bytes,
spin_lock(&cache->lock);
bg_ro = cache->ro;
cache->reserved -= num_bytes;
+
+ if (btrfs_block_group_should_use_size_class(cache)) {
+ if (cache->used == 0 && cache->reserved == 0)
+ cache->size_class = BTRFS_BG_SZ_NONE;
+ }
+
if (is_delalloc)
cache->delalloc_bytes -= num_bytes;
spin_unlock(&cache->lock);
--
2.25.1
On Mon, Jan 12, 2026 at 03:02:48PM +0000, Jiasheng Jiang wrote:
> Differential analysis of block-group.c shows an inconsistency between
> btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes().
>
> When space is reserved, btrfs_use_block_group_size_class() is called to
> set a block group's size class, specializing it for a specific allocation
> size to reduce fragmentation. However, when these reservations are
> subsequently freed (e.g., due to an error or transaction abort),
> btrfs_free_reserved_bytes() fails to perform the corresponding cleanup.
>
> This leads to a state where a block group remains stuck with a specific
> size class even if it contains no used or reserved bytes. While the
> impact depends on the workload, this stale state can cause
> find_free_extent to unnecessarily skip these block groups for mismatched
> size requests.
> In some cases, it may even trigger the allocation of new block groups if
> no matching or unassigned block groups are available.
Sorry I missed v1 and v2 of this, so I may be late with my questions.
This seems reasonable enough, but I am wondering why you are only doing
it for the error handling path.
We can also legitimately free the last extent in a block group which will
go through btrfs_update_block_group(). That will either mark it unused or
for async discard, which may eventually result in the bg getting cleaned up.
But in the meantime it is eligible to be used again so by the same
reasoning as this patch, it shouldn't have a size class.
That aside, I don't think this should be a huge issue for two reasons:
1. size classes are advisory and allocations won't fail because of them,
so I don't think your assertion about triggering new allocations is
accurate. (see LOOP_WRONG_SIZE_CLASS)
2. the bg is likely to end up getting used anyway, so it would take
pretty specific conditions to meaningfully "leak" one.
If this issue is something you have observed in practice rather than a
(valid) theoretical problem, I'd be curious to hear more about your
workload and the negative effects you observed.
One final point: this also makes me wonder if it would be beneficial to
merge some of the logic on hitting 0 used+reserved between
btrfs_update_block_group() and btrfs_free_reserve_bytes(). I imagine we
might be able to also miss a block group that needs to be marked unused
on this path.
Thanks,
Boris
>
> Fix this by resetting the size class to BTRFS_BG_SZ_NONE in
> btrfs_free_reserved_bytes() when the block group becomes completely
> empty.
>
> Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator")
> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
> ---
> Changelog:
>
> v2 -> v3:
> 1. Corrected the "Fixes" tag to 52bb7a2166af.
> 2. Updated the commit message to reflect that the performance impact is workload-dependent.
> 3. Added mention that the issue can lead to unnecessary allocation of new block groups.
>
> v1 -> v2:
> 1. Inlined btrfs_maybe_reset_size_class() function.
> 2. Moved check below the reserved bytes decrement in btrfs_free_reserved_bytes().
> ---
> fs/btrfs/block-group.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 08b14449fabe..8339ad001d3f 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -3867,6 +3867,12 @@ void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, u64 num_bytes,
> spin_lock(&cache->lock);
> bg_ro = cache->ro;
> cache->reserved -= num_bytes;
> +
> + if (btrfs_block_group_should_use_size_class(cache)) {
> + if (cache->used == 0 && cache->reserved == 0)
> + cache->size_class = BTRFS_BG_SZ_NONE;
> + }
> +
> if (is_delalloc)
> cache->delalloc_bytes -= num_bytes;
> spin_unlock(&cache->lock);
> --
> 2.25.1
>
Differential analysis of block-group.c shows an inconsistency in how
block group size classes are managed.
Currently, btrfs_use_block_group_size_class() sets a block group's size
class to specialize it for a specific allocation size. However, this
size class remains "stale" even if the block group becomes completely
empty (both used and reserved bytes reach zero).
This happens in two scenarios:
1. When space reservations are freed (e.g., due to errors or transaction
aborts) via btrfs_free_reserved_bytes().
2. When the last extent in a block group is freed via
btrfs_update_block_group().
While size classes are advisory, a stale size class can cause
find_free_extent to unnecessarily skip candidate block groups during
initial search loops. This undermines the purpose of size classes—to
reduce fragmentation—by keeping block groups restricted to a specific
size class when they could be reused for any size.
Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a
block group's used and reserved counts both reach zero. This ensures
that empty block groups are fully available for any allocation size in
the next cycle.
Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
---
Changelog:
v3 -> v4:
1. Introduced btrfs_maybe_reset_size_class() helper to unify the logic.
2. Expanded the fix to include btrfs_update_block_group() to handle cases where the last extent in a block group is freed.
3. Refined the commit message to clarify that size classes are advisory and their stale state impacts allocation efficiency rather than causing absolute allocation failures.
v2 -> v3:
1. Corrected the "Fixes" tag to 52bb7a2166af.
2. Updated the commit message to reflect that the performance impact is workload-dependent.
3. Added mention that the issue can lead to unnecessary allocation of new block groups.
v1 -> v2:
1. Inlined btrfs_maybe_reset_size_class() function.
2. Moved check below the reserved bytes decrement in btrfs_free_reserved_bytes().
---
fs/btrfs/block-group.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 08b14449fabe..343d7724939f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3675,6 +3675,14 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans)
return ret;
}
+static void btrfs_maybe_reset_size_class(struct btrfs_block_group *cache)
+{
+ lockdep_assert_held(&cache->lock);
+ if (btrfs_block_group_should_use_size_class(cache) &&
+ cache->used == 0 && cache->reserved == 0)
+ cache->size_class = BTRFS_BG_SZ_NONE;
+}
+
int btrfs_update_block_group(struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes, bool alloc)
{
@@ -3739,6 +3747,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
old_val -= num_bytes;
cache->used = old_val;
cache->pinned += num_bytes;
+ btrfs_maybe_reset_size_class(cache);
btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
space_info->bytes_used -= num_bytes;
space_info->disk_used -= num_bytes * factor;
@@ -3867,6 +3876,7 @@ void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, u64 num_bytes,
spin_lock(&cache->lock);
bg_ro = cache->ro;
cache->reserved -= num_bytes;
+ btrfs_maybe_reset_size_class(cache);
if (is_delalloc)
cache->delalloc_bytes -= num_bytes;
spin_unlock(&cache->lock);
--
2.25.1
On Tue, Jan 13, 2026 at 08:55:32PM +0000, Jiasheng Jiang wrote:
> Differential analysis of block-group.c shows an inconsistency in how
> block group size classes are managed.
>
> Currently, btrfs_use_block_group_size_class() sets a block group's size
> class to specialize it for a specific allocation size. However, this
> size class remains "stale" even if the block group becomes completely
> empty (both used and reserved bytes reach zero).
>
> This happens in two scenarios:
> 1. When space reservations are freed (e.g., due to errors or transaction
> aborts) via btrfs_free_reserved_bytes().
> 2. When the last extent in a block group is freed via
> btrfs_update_block_group().
>
> While size classes are advisory, a stale size class can cause
> find_free_extent to unnecessarily skip candidate block groups during
> initial search loops. This undermines the purpose of size classes—to
> reduce fragmentation—by keeping block groups restricted to a specific
> size class when they could be reused for any size.
>
> Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a
> block group's used and reserved counts both reach zero. This ensures
> that empty block groups are fully available for any allocation size in
> the next cycle.
>
> Fixes: 52bb7a2166af ("btrfs: introduce size class to block group allocator")
I don't think this necessarily amounts to a bug of the level that needs
a Fixes tag, since I don't think it should realistically break any
workload nor is it that urgent to backport to stable branches.
But if we want to use these tags to note targeted improvements / semi-bugs
then I am fine leaving it.
Either way,
Reviewed-by: Boris Burkov <boris@bur.io>
> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
> ---
> Changelog:
>
> v3 -> v4:
> 1. Introduced btrfs_maybe_reset_size_class() helper to unify the logic.
> 2. Expanded the fix to include btrfs_update_block_group() to handle cases where the last extent in a block group is freed.
> 3. Refined the commit message to clarify that size classes are advisory and their stale state impacts allocation efficiency rather than causing absolute allocation failures.
>
> v2 -> v3:
> 1. Corrected the "Fixes" tag to 52bb7a2166af.
> 2. Updated the commit message to reflect that the performance impact is workload-dependent.
> 3. Added mention that the issue can lead to unnecessary allocation of new block groups.
>
> v1 -> v2:
> 1. Inlined btrfs_maybe_reset_size_class() function.
> 2. Moved check below the reserved bytes decrement in btrfs_free_reserved_bytes().
> ---
> fs/btrfs/block-group.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 08b14449fabe..343d7724939f 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -3675,6 +3675,14 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans)
> return ret;
> }
>
> +static void btrfs_maybe_reset_size_class(struct btrfs_block_group *cache)
> +{
> + lockdep_assert_held(&cache->lock);
> + if (btrfs_block_group_should_use_size_class(cache) &&
> + cache->used == 0 && cache->reserved == 0)
> + cache->size_class = BTRFS_BG_SZ_NONE;
> +}
> +
> int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> u64 bytenr, u64 num_bytes, bool alloc)
> {
> @@ -3739,6 +3747,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
> old_val -= num_bytes;
> cache->used = old_val;
> cache->pinned += num_bytes;
> + btrfs_maybe_reset_size_class(cache);
> btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
> space_info->bytes_used -= num_bytes;
> space_info->disk_used -= num_bytes * factor;
> @@ -3867,6 +3876,7 @@ void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, u64 num_bytes,
> spin_lock(&cache->lock);
> bg_ro = cache->ro;
> cache->reserved -= num_bytes;
> + btrfs_maybe_reset_size_class(cache);
> if (is_delalloc)
> cache->delalloc_bytes -= num_bytes;
> spin_unlock(&cache->lock);
> --
> 2.25.1
>
Differential analysis of block-group.c shows an inconsistency in how
block group size classes are managed.
Currently, btrfs_use_block_group_size_class() sets a block group's size
class to specialize it for a specific allocation size. However, this
size class remains "stale" even if the block group becomes completely
empty (both used and reserved bytes reach zero).
This happens in two scenarios:
1. When space reservations are freed (e.g., due to errors or transaction
aborts) via btrfs_free_reserved_bytes().
2. When the last extent in a block group is freed via
btrfs_update_block_group().
While size classes are advisory, a stale size class can cause
find_free_extent to unnecessarily skip candidate block groups during
initial search loops. This undermines the purpose of size classes—to
reduce fragmentation—by keeping block groups restricted to a specific
size class when they could be reused for any size.
Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a
block group's used and reserved counts both reach zero. This ensures
that empty block groups are fully available for any allocation size in
the next cycle.
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
---
Changelog:
v4 -> v5:
1. Remove the Fixes tag.
v3 -> v4:
1. Introduced btrfs_maybe_reset_size_class() helper to unify the logic.
2. Expanded the fix to include btrfs_update_block_group() to handle cases where the last extent in a block group is freed.
3. Refined the commit message to clarify that size classes are advisory and their stale state impacts allocation efficiency rather than causing absolute allocation failures.
v2 -> v3:
1. Corrected the "Fixes" tag to 52bb7a2166af.
2. Updated the commit message to reflect that the performance impact is workload-dependent.
3. Added mention that the issue can lead to unnecessary allocation of new block groups.
v1 -> v2:
1. Inlined btrfs_maybe_reset_size_class() function.
2. Moved check below the reserved bytes decrement in btrfs_free_reserved_bytes().
---
fs/btrfs/block-group.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 08b14449fabe..343d7724939f 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3675,6 +3675,14 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans)
return ret;
}
+static void btrfs_maybe_reset_size_class(struct btrfs_block_group *cache)
+{
+ lockdep_assert_held(&cache->lock);
+ if (btrfs_block_group_should_use_size_class(cache) &&
+ cache->used == 0 && cache->reserved == 0)
+ cache->size_class = BTRFS_BG_SZ_NONE;
+}
+
int btrfs_update_block_group(struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes, bool alloc)
{
@@ -3739,6 +3747,7 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans,
old_val -= num_bytes;
cache->used = old_val;
cache->pinned += num_bytes;
+ btrfs_maybe_reset_size_class(cache);
btrfs_space_info_update_bytes_pinned(space_info, num_bytes);
space_info->bytes_used -= num_bytes;
space_info->disk_used -= num_bytes * factor;
@@ -3867,6 +3876,7 @@ void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, u64 num_bytes,
spin_lock(&cache->lock);
bg_ro = cache->ro;
cache->reserved -= num_bytes;
+ btrfs_maybe_reset_size_class(cache);
if (is_delalloc)
cache->delalloc_bytes -= num_bytes;
spin_unlock(&cache->lock);
--
2.25.1
On Wed, Jan 14, 2026 at 01:13:38AM +0000, Jiasheng Jiang wrote: > Differential analysis of block-group.c shows an inconsistency in how > block group size classes are managed. > > Currently, btrfs_use_block_group_size_class() sets a block group's size > class to specialize it for a specific allocation size. However, this > size class remains "stale" even if the block group becomes completely > empty (both used and reserved bytes reach zero). > > This happens in two scenarios: > 1. When space reservations are freed (e.g., due to errors or transaction > aborts) via btrfs_free_reserved_bytes(). > 2. When the last extent in a block group is freed via > btrfs_update_block_group(). > > While size classes are advisory, a stale size class can cause > find_free_extent to unnecessarily skip candidate block groups during > initial search loops. This undermines the purpose of size classes—to > reduce fragmentation—by keeping block groups restricted to a specific > size class when they could be reused for any size. > > Fix this by resetting the size class to BTRFS_BG_SZ_NONE whenever a > block group's used and reserved counts both reach zero. This ensures > that empty block groups are fully available for any allocation size in > the next cycle. > > Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Added to for-next, thanks. > v4 -> v5: > 1. Remove the Fixes tag. I've added the tag back as I think it may make sense to backport it. > +static void btrfs_maybe_reset_size_class(struct btrfs_block_group *cache) Renamed 'cache' to 'bg' as the cache was from old code where it was referring to the in memory cache of block groups, while the object we care about is the block group.
© 2016 - 2026 Red Hat, Inc.