[PATCH v2 3/3] btrfs: fix check_chunk_block_group_mappings() to actually iterate all chunks

ZhengYuan Huang posted 3 patches 3 weeks, 2 days ago
There is a newer version of this series
[PATCH v2 3/3] btrfs: fix check_chunk_block_group_mappings() to actually iterate all chunks
Posted by ZhengYuan Huang 3 weeks, 2 days ago
[BUG]
A corrupted image with a chunk present in the chunk tree but whose
corresponding block group item is missing from the extent tree can be
mounted successfully, even though check_chunk_block_group_mappings()
is supposed to catch exactly this corruption at mount time.  Once
mounted, running btrfs balance with a usage filter (-dusage=N or
-dusage=min..max) triggers a null-ptr-deref:

  KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
    RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline]
    RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline]
    RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline]
    RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604

The crash occurs because __btrfs_balance() iterates the on-disk chunk
tree, finds the orphaned chunk, calls chunk_usage_filter() (or
chunk_usage_range_filter()), which queries the in-memory block group
cache via btrfs_lookup_block_group().  Since no block group was ever
inserted for this chunk, the lookup returns NULL, and the subsequent
dereference of cache->used crashes.

[CAUSE]
check_chunk_block_group_mappings() uses btrfs_find_chunk_map() to
iterate the in-memory chunk map (fs_info->mapping_tree):

  map = btrfs_find_chunk_map(fs_info, start, 1);

With @start = 0 and @length = 1, btrfs_find_chunk_map() looks for a
chunk map that *contains* the logical address 0. If no chunk contains
logical address 0, btrfs_find_chunk_map(fs_info, 0, 1) returns NULL
immediately and the loop breaks after the very first iteration,
having checked zero chunks. The entire verification function is therefore
a no-op, and the corrupted image passes the mount-time check undetected.

[FIX]
Replace the btrfs_find_chunk_map() based loop with a direct in-order
walk of fs_info->mapping_tree using rb_first_cached() + rb_next(),
protected by mapping_tree_lock. This guarantees that every chunk map
in the tree is visited regardless of the logical addresses involved.
Since the mapping_tree itself is accessed under read_lock, no refcount
manipulation of each map entry is needed inside the loop, so the
btrfs_free_chunk_map() calls on the map are also removed.

Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
---
 fs/btrfs/block-group.c | 21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 5322ef2ae015..25bd0d058be6 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2319,29 +2319,22 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
  */
 static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 {
-	u64 start = 0;
+	struct rb_node *node;
 	int ret = 0;
 
-	while (1) {
+	read_lock(&fs_info->mapping_tree_lock);
+	for (node = rb_first_cached(&fs_info->mapping_tree); node;
+	     node = rb_next(node)) {
 		struct btrfs_chunk_map *map;
 		struct btrfs_block_group *bg;
 
-		/*
-		 * btrfs_find_chunk_map() will return the first chunk map
-		 * intersecting the range, so setting @length to 1 is enough to
-		 * get the first chunk.
-		 */
-		map = btrfs_find_chunk_map(fs_info, start, 1);
-		if (!map)
-			break;
-
+		map = rb_entry(node, struct btrfs_chunk_map, rb_node);
 		bg = btrfs_lookup_block_group(fs_info, map->start);
 		if (unlikely(!bg)) {
 			btrfs_err(fs_info,
 	"chunk start=%llu len=%llu doesn't have corresponding block group",
 				     map->start, map->chunk_len);
 			ret = -EUCLEAN;
-			btrfs_free_chunk_map(map);
 			break;
 		}
 		if (unlikely(bg->start != map->start || bg->length != map->chunk_len ||
@@ -2354,14 +2347,12 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
 				bg->start, bg->length,
 				bg->flags & BTRFS_BLOCK_GROUP_TYPE_MASK);
 			ret = -EUCLEAN;
-			btrfs_free_chunk_map(map);
 			btrfs_put_block_group(bg);
 			break;
 		}
-		start = map->start + map->chunk_len;
-		btrfs_free_chunk_map(map);
 		btrfs_put_block_group(bg);
 	}
+	read_unlock(&fs_info->mapping_tree_lock);
 	return ret;
 }
 
-- 
2.43.0
Re: [PATCH v2 3/3] btrfs: fix check_chunk_block_group_mappings() to actually iterate all chunks
Posted by David Sterba 2 weeks ago
On Sat, Mar 14, 2026 at 08:37:41PM +0800, ZhengYuan Huang wrote:
> [BUG]
> A corrupted image with a chunk present in the chunk tree but whose
> corresponding block group item is missing from the extent tree can be
> mounted successfully, even though check_chunk_block_group_mappings()
> is supposed to catch exactly this corruption at mount time.  Once
> mounted, running btrfs balance with a usage filter (-dusage=N or
> -dusage=min..max) triggers a null-ptr-deref:
> 
>   KASAN: null-ptr-deref in range [0x0000000000000070-0x0000000000000077]
>     RIP: 0010:chunk_usage_filter fs/btrfs/volumes.c:3874 [inline]
>     RIP: 0010:should_balance_chunk fs/btrfs/volumes.c:4018 [inline]
>     RIP: 0010:__btrfs_balance fs/btrfs/volumes.c:4172 [inline]
>     RIP: 0010:btrfs_balance+0x2024/0x42b0 fs/btrfs/volumes.c:4604
> 
> The crash occurs because __btrfs_balance() iterates the on-disk chunk
> tree, finds the orphaned chunk, calls chunk_usage_filter() (or
> chunk_usage_range_filter()), which queries the in-memory block group
> cache via btrfs_lookup_block_group().  Since no block group was ever
> inserted for this chunk, the lookup returns NULL, and the subsequent
> dereference of cache->used crashes.
> 
> [CAUSE]
> check_chunk_block_group_mappings() uses btrfs_find_chunk_map() to
> iterate the in-memory chunk map (fs_info->mapping_tree):
> 
>   map = btrfs_find_chunk_map(fs_info, start, 1);
> 
> With @start = 0 and @length = 1, btrfs_find_chunk_map() looks for a
> chunk map that *contains* the logical address 0. If no chunk contains
> logical address 0, btrfs_find_chunk_map(fs_info, 0, 1) returns NULL
> immediately and the loop breaks after the very first iteration,
> having checked zero chunks. The entire verification function is therefore
> a no-op, and the corrupted image passes the mount-time check undetected.
> 
> [FIX]
> Replace the btrfs_find_chunk_map() based loop with a direct in-order
> walk of fs_info->mapping_tree using rb_first_cached() + rb_next(),
> protected by mapping_tree_lock. This guarantees that every chunk map
> in the tree is visited regardless of the logical addresses involved.
> Since the mapping_tree itself is accessed under read_lock, no refcount
> manipulation of each map entry is needed inside the loop, so the
> btrfs_free_chunk_map() calls on the map are also removed.
> 
> Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
> ---
>  fs/btrfs/block-group.c | 21 ++++++---------------
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 5322ef2ae015..25bd0d058be6 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -2319,29 +2319,22 @@ static struct btrfs_block_group *btrfs_create_block_group_cache(
>   */
>  static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
>  {
> -	u64 start = 0;
> +	struct rb_node *node;
>  	int ret = 0;
>  
> -	while (1) {
> +	read_lock(&fs_info->mapping_tree_lock);

This is called during mount indirectly from open_ctree() and this is
single threaded (partially), so the lock may not be needed. It would be
needed if there's eg. caching thread possibly accessing the same
structures, I haven't looked closely.

> +	for (node = rb_first_cached(&fs_info->mapping_tree); node;
> +	     node = rb_next(node)) {
>  		struct btrfs_chunk_map *map;
>  		struct btrfs_block_group *bg;
>  
> -		/*
> -		 * btrfs_find_chunk_map() will return the first chunk map
> -		 * intersecting the range, so setting @length to 1 is enough to
> -		 * get the first chunk.
> -		 */
> -		map = btrfs_find_chunk_map(fs_info, start, 1);
> -		if (!map)
> -			break;
> -
> +		map = rb_entry(node, struct btrfs_chunk_map, rb_node);
>  		bg = btrfs_lookup_block_group(fs_info, map->start);

What concerns me is this lookup. Previously the references avoided
taking the big lock. The time the lock is held may add up significanly
for all block groups but as said before it might not be necessary due to
the mount context.

>  		if (unlikely(!bg)) {
>  			btrfs_err(fs_info,
>  	"chunk start=%llu len=%llu doesn't have corresponding block group",
>  				     map->start, map->chunk_len);
>  			ret = -EUCLEAN;
> -			btrfs_free_chunk_map(map);
>  			break;
>  		}
>  		if (unlikely(bg->start != map->start || bg->length != map->chunk_len ||
> @@ -2354,14 +2347,12 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info)
>  				bg->start, bg->length,
>  				bg->flags & BTRFS_BLOCK_GROUP_TYPE_MASK);
>  			ret = -EUCLEAN;
> -			btrfs_free_chunk_map(map);
>  			btrfs_put_block_group(bg);
>  			break;
>  		}
> -		start = map->start + map->chunk_len;
> -		btrfs_free_chunk_map(map);
>  		btrfs_put_block_group(bg);
>  	}
> +	read_unlock(&fs_info->mapping_tree_lock);
>  	return ret;
>  }
>  
> -- 
> 2.43.0
>
Re: [PATCH v2 3/3] btrfs: fix check_chunk_block_group_mappings() to actually iterate all chunks
Posted by ZhengYuan Huang 2 weeks ago
On Tue, Mar 24, 2026 at 1:52 AM David Sterba <dsterba@suse.cz> wrote:
> This is called during mount indirectly from open_ctree() and this is
> single threaded (partially), so the lock may not be needed. It would be
> needed if there's eg. caching thread possibly accessing the same
> structures, I haven't looked closely.
>
> > +     for (node = rb_first_cached(&fs_info->mapping_tree); node;
> > +          node = rb_next(node)) {
> >               struct btrfs_chunk_map *map;
> >               struct btrfs_block_group *bg;
> >
> > -             /*
> > -              * btrfs_find_chunk_map() will return the first chunk map
> > -              * intersecting the range, so setting @length to 1 is enough to
> > -              * get the first chunk.
> > -              */
> > -             map = btrfs_find_chunk_map(fs_info, start, 1);
> > -             if (!map)
> > -                     break;
> > -
> > +             map = rb_entry(node, struct btrfs_chunk_map, rb_node);
> >               bg = btrfs_lookup_block_group(fs_info, map->start);
>
> What concerns me is this lookup. Previously the references avoided
> taking the big lock. The time the lock is held may add up significanly
> for all block groups but as said before it might not be necessary due to
> the mount context.

Thanks for the suggestion, I’ll take a closer look at the locking here.
If the lock turns out to be unnecessary in this context, I’ll drop it
and include the change in v3.

Thanks,
ZhengYuan Huang