btrfs: disk-io: reject misaligned tree blocks in btree_csum_one_bio

[PATCH] btrfs: disk-io: reject misaligned tree blocks in btree_csum_one_bio

Posted by ZhengYuan Huang 1 week, 1 day ago

[BUG]
Running btrfs balance on a corrupt image can trigger a GPF, with KASAN
reporting a wild memory access:

  BTRFS warning: tree block not nodesize aligned, start 6179131392 nodesize 16384, can be resolved by a full metadata balance
  Oops: general protection fault, probably for non-canonical address 0xe0009d1000000052: 0000 [#1] SMP KASAN NOPTI
  KASAN: maybe wild-memory-access in range [0x0005088000000290-0x0005088000000297]
  Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
  RIP: 0010:get_unaligned_le64 include/linux/unaligned.h:28 [inline]
  RIP: 0010:btrfs_header_bytenr fs/btrfs/accessors.h:647 [inline]
  RIP: 0010:btree_csum_one_bio+0x175/0xfe0 fs/btrfs/disk-io.c:263
  Call Trace:
	<TASK>
	btrfs_bio_csum fs/btrfs/bio.c:511 [inline]
	btrfs_submit_chunk+0x138d/0x1750 fs/btrfs/bio.c:744
	btrfs_submit_bbio+0x20/0x40 fs/btrfs/bio.c:814
	write_one_eb+0x9ea/0xd30 fs/btrfs/extent_io.c:2239
	btree_write_cache_pages+0x836/0xdc0 fs/btrfs/extent_io.c:2342
	btree_writepages+0x163/0x1c0 fs/btrfs/disk-io.c:512
	do_writepages+0x255/0x5c0 mm/page-writeback.c:2604
	filemap_fdatawrite_wbc mm/filemap.c:389 [inline]
	filemap_fdatawrite_wbc+0xf2/0x150 mm/filemap.c:379
	__filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:422
	filemap_fdatawrite_range+0x2f/0x50 mm/filemap.c:440
	btrfs_write_marked_extents+0x13c/0x2d0 fs/btrfs/transaction.c:1157
	btrfs_write_and_wait_transaction+0xe5/0x250 fs/btrfs/transaction.c:1264
	btrfs_commit_transaction+0x28af/0x3d90 fs/btrfs/transaction.c:2533
	insert_balance_item.isra.0+0x392/0x3f0 fs/btrfs/volumes.c:3712
	btrfs_balance+0x1021/0x42b0 fs/btrfs/volumes.c:4582
	btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
	btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
	...

[CAUSE]
The corrupt image contains a tree block whose start address (6179131392)
is page-aligned (4 KiB) but NOT nodesize-aligned (16 KiB):

  6179131392 % 16384 == 4096

When alloc_extent_buffer() is called for such a block,
check_eb_alignment() detects the nodesize misalignment, but only emits
a one-time btrfs_warn() and returns false without failing the
allocation. This allows the extent buffer to be created with a
misaligned start.

Later, during transaction commit triggered by balance, write_one_eb()
submits the dirty extent buffer for writeback, and
btree_csum_one_bio() is called to checksum it before I/O submission.
That path calls btrfs_header_bytenr(eb), which expands via
BTRFS_SETGET_HEADER_FUNCS to:

  folio_address(eb->folios[0]) + offset_in_page(eb->start)

With a nodesize-misaligned start, eb->folios[0] does not correspond to
a valid direct-mapped kernel address. folio_address() returns the
garbage value 0x0005088000000260, and dereferencing +0x30 (the bytenr
field offset in struct btrfs_header) triggers the GPF.

[FIX]
Add a WARN_ON_ONCE() nodesize alignment check at the beginning of
btree_csum_one_bio() and return -EIO for misaligned tree blocks.

btree_csum_one_bio() already guards against corrupted extent buffer
state on the checksum path, and it also revalidates metadata on the
write path. The alignment check follows that pattern and must happen
before the first access to eb->folios[] via btrfs_header_bytenr().

Fixes: 6d3a61945b00 ("btrfs: warn on tree blocks which are not nodesize aligned")
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
---
An alternative fix of promoting check_eb_alignment() from warn to error
would prevent the misaligned eb from being created at all, but would
break mount and repair workflows: users need to be able to read and
inspect a filesystem containing legacy misaligned tree blocks in order
to run "btrfs balance -m" and correct the alignment.
---
 fs/btrfs/disk-io.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0aa7e5d1b05f..5ca9e63e51d6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -260,6 +260,11 @@ int btree_csum_one_bio(struct btrfs_bio *bbio)
 {
 	struct extent_buffer *eb = bbio->private;
 	struct btrfs_fs_info *fs_info = eb->fs_info;
+
+	/* A nodesize-misaligned eb has corrupted folio mapping. */
+	if (WARN_ON_ONCE(!IS_ALIGNED(eb->start, fs_info->nodesize)))
+		return -EIO;
+
 	u64 found_start = btrfs_header_bytenr(eb);
 	u64 last_trans;
 	u8 result[BTRFS_CSUM_SIZE];
-- 
2.43.0

Re: [PATCH] btrfs: disk-io: reject misaligned tree blocks in btree_csum_one_bio

Posted by David Sterba 1 day, 23 hours ago

On Wed, Mar 25, 2026 at 06:04:11PM +0800, ZhengYuan Huang wrote:
> [BUG]
> Running btrfs balance on a corrupt image can trigger a GPF, with KASAN
> reporting a wild memory access:
> 
>   BTRFS warning: tree block not nodesize aligned, start 6179131392 nodesize 16384, can be resolved by a full metadata balance
>   Oops: general protection fault, probably for non-canonical address 0xe0009d1000000052: 0000 [#1] SMP KASAN NOPTI
>   KASAN: maybe wild-memory-access in range [0x0005088000000290-0x0005088000000297]
>   Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
>   RIP: 0010:get_unaligned_le64 include/linux/unaligned.h:28 [inline]
>   RIP: 0010:btrfs_header_bytenr fs/btrfs/accessors.h:647 [inline]
>   RIP: 0010:btree_csum_one_bio+0x175/0xfe0 fs/btrfs/disk-io.c:263
>   Call Trace:
> 	<TASK>
> 	btrfs_bio_csum fs/btrfs/bio.c:511 [inline]
> 	btrfs_submit_chunk+0x138d/0x1750 fs/btrfs/bio.c:744
> 	btrfs_submit_bbio+0x20/0x40 fs/btrfs/bio.c:814
> 	write_one_eb+0x9ea/0xd30 fs/btrfs/extent_io.c:2239
> 	btree_write_cache_pages+0x836/0xdc0 fs/btrfs/extent_io.c:2342
> 	btree_writepages+0x163/0x1c0 fs/btrfs/disk-io.c:512
> 	do_writepages+0x255/0x5c0 mm/page-writeback.c:2604
> 	filemap_fdatawrite_wbc mm/filemap.c:389 [inline]
> 	filemap_fdatawrite_wbc+0xf2/0x150 mm/filemap.c:379
> 	__filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:422
> 	filemap_fdatawrite_range+0x2f/0x50 mm/filemap.c:440
> 	btrfs_write_marked_extents+0x13c/0x2d0 fs/btrfs/transaction.c:1157
> 	btrfs_write_and_wait_transaction+0xe5/0x250 fs/btrfs/transaction.c:1264
> 	btrfs_commit_transaction+0x28af/0x3d90 fs/btrfs/transaction.c:2533
> 	insert_balance_item.isra.0+0x392/0x3f0 fs/btrfs/volumes.c:3712
> 	btrfs_balance+0x1021/0x42b0 fs/btrfs/volumes.c:4582
> 	btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
> 	btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
> 	...
> 
> [CAUSE]
> The corrupt image contains a tree block whose start address (6179131392)
> is page-aligned (4 KiB) but NOT nodesize-aligned (16 KiB):
> 
>   6179131392 % 16384 == 4096

While you say it's a corrupted image it feels like it was crafted to
have such offset. The warning is from 6d3a61945b0088 ("btrfs: warn on
tree blocks which are not nodesize aligned") and it tries to catch
problems of misaligned ebs.

As we'll be moving to the large folios eventually such misaligned blocks
will become a hard problem. So this should answer if this should be a
warning or an error.

As the commit and error message suggests to run balance to fix the
alignment problem I see that this should be somehow fixed if the crash
happens inside balance. On the other hand, the misalignment should not
happen at all.

As we try to be cautious about recognizing old filesystems with
potential problems we also have to stop at some point if it blocks a new
feature. The grace period is IMO long enough.

If you have reprocued the problem by normal operations then we should
look for the solution to prevent it. If it's from a crafted image that
basically creates a valid image, shifts a block to be come misaligned
and otherwise valid then I suggest to turn the warning to error and
reject the filesystem as early as possible.

> When alloc_extent_buffer() is called for such a block,
> check_eb_alignment() detects the nodesize misalignment, but only emits
> a one-time btrfs_warn() and returns false without failing the
> allocation. This allows the extent buffer to be created with a
> misaligned start.
> 
> Later, during transaction commit triggered by balance, write_one_eb()
> submits the dirty extent buffer for writeback, and
> btree_csum_one_bio() is called to checksum it before I/O submission.
> That path calls btrfs_header_bytenr(eb), which expands via
> BTRFS_SETGET_HEADER_FUNCS to:
> 
>   folio_address(eb->folios[0]) + offset_in_page(eb->start)
> 
> With a nodesize-misaligned start, eb->folios[0] does not correspond to
> a valid direct-mapped kernel address. folio_address() returns the
> garbage value 0x0005088000000260, and dereferencing +0x30 (the bytenr
> field offset in struct btrfs_header) triggers the GPF.
> 
> [FIX]
> Add a WARN_ON_ONCE() nodesize alignment check at the beginning of
> btree_csum_one_bio() and return -EIO for misaligned tree blocks.
> 
> btree_csum_one_bio() already guards against corrupted extent buffer
> state on the checksum path, and it also revalidates metadata on the
> write path. The alignment check follows that pattern and must happen
> before the first access to eb->folios[] via btrfs_header_bytenr().
> 
> Fixes: 6d3a61945b00 ("btrfs: warn on tree blocks which are not nodesize aligned")
> Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
> ---
> An alternative fix of promoting check_eb_alignment() from warn to error
> would prevent the misaligned eb from being created at all, but would
> break mount and repair workflows: users need to be able to read and
> inspect a filesystem containing legacy misaligned tree blocks in order
> to run "btrfs balance -m" and correct the alignment.

While I agree with that I think we should start rejecting such
filesystems because of the large folio support and because we hopefully
have spent the grace period without new reports and incidents.

If you have a crafted image, and possibly a minimal one, I can add it to
the btrfs-progs fuzzed images so it can be verified as part of the test
suite.

Re: [PATCH] btrfs: disk-io: reject misaligned tree blocks in btree_csum_one_bio

Posted by Qu Wenruo 1 day, 2 hours ago


在 2026/4/1 10:35, David Sterba 写道:
> On Wed, Mar 25, 2026 at 06:04:11PM +0800, ZhengYuan Huang wrote:
>> [BUG]
>> Running btrfs balance on a corrupt image can trigger a GPF, with KASAN
>> reporting a wild memory access:
>>
>>    BTRFS warning: tree block not nodesize aligned, start 6179131392 nodesize 16384, can be resolved by a full metadata balance
>>    Oops: general protection fault, probably for non-canonical address 0xe0009d1000000052: 0000 [#1] SMP KASAN NOPTI
>>    KASAN: maybe wild-memory-access in range [0x0005088000000290-0x0005088000000297]
>>    Hardware name: QEMU Ubuntu 24.04 PC v2, BIOS 1.16.3-debian-1.16.3-2
>>    RIP: 0010:get_unaligned_le64 include/linux/unaligned.h:28 [inline]
>>    RIP: 0010:btrfs_header_bytenr fs/btrfs/accessors.h:647 [inline]
>>    RIP: 0010:btree_csum_one_bio+0x175/0xfe0 fs/btrfs/disk-io.c:263
>>    Call Trace:
>> 	<TASK>
>> 	btrfs_bio_csum fs/btrfs/bio.c:511 [inline]
>> 	btrfs_submit_chunk+0x138d/0x1750 fs/btrfs/bio.c:744
>> 	btrfs_submit_bbio+0x20/0x40 fs/btrfs/bio.c:814
>> 	write_one_eb+0x9ea/0xd30 fs/btrfs/extent_io.c:2239
>> 	btree_write_cache_pages+0x836/0xdc0 fs/btrfs/extent_io.c:2342
>> 	btree_writepages+0x163/0x1c0 fs/btrfs/disk-io.c:512
>> 	do_writepages+0x255/0x5c0 mm/page-writeback.c:2604
>> 	filemap_fdatawrite_wbc mm/filemap.c:389 [inline]
>> 	filemap_fdatawrite_wbc+0xf2/0x150 mm/filemap.c:379
>> 	__filemap_fdatawrite_range+0xd2/0x120 mm/filemap.c:422
>> 	filemap_fdatawrite_range+0x2f/0x50 mm/filemap.c:440
>> 	btrfs_write_marked_extents+0x13c/0x2d0 fs/btrfs/transaction.c:1157
>> 	btrfs_write_and_wait_transaction+0xe5/0x250 fs/btrfs/transaction.c:1264
>> 	btrfs_commit_transaction+0x28af/0x3d90 fs/btrfs/transaction.c:2533
>> 	insert_balance_item.isra.0+0x392/0x3f0 fs/btrfs/volumes.c:3712
>> 	btrfs_balance+0x1021/0x42b0 fs/btrfs/volumes.c:4582
>> 	btrfs_ioctl_balance fs/btrfs/ioctl.c:3577 [inline]
>> 	btrfs_ioctl+0x25cf/0x5b90 fs/btrfs/ioctl.c:5313
>> 	...
>>
>> [CAUSE]
>> The corrupt image contains a tree block whose start address (6179131392)
>> is page-aligned (4 KiB) but NOT nodesize-aligned (16 KiB):
>>
>>    6179131392 % 16384 == 4096
> 
> While you say it's a corrupted image it feels like it was crafted to
> have such offset. The warning is from 6d3a61945b0088 ("btrfs: warn on
> tree blocks which are not nodesize aligned") and it tries to catch
> problems of misaligned ebs.
> 
> As we'll be moving to the large folios eventually such misaligned blocks
> will become a hard problem. So this should answer if this should be a
> warning or an error.
> 
> As the commit and error message suggests to run balance to fix the
> alignment problem I see that this should be somehow fixed if the crash
> happens inside balance. On the other hand, the misalignment should not
> happen at all.
> 
> As we try to be cautious about recognizing old filesystems with
> potential problems we also have to stop at some point if it blocks a new
> feature. The grace period is IMO long enough.
> 
> If you have reprocued the problem by normal operations then we should
> look for the solution to prevent it. If it's from a crafted image that
> basically creates a valid image, shifts a block to be come misaligned
> and otherwise valid then I suggest to turn the warning to error and
> reject the filesystem as early as possible.
> 
>> When alloc_extent_buffer() is called for such a block,
>> check_eb_alignment() detects the nodesize misalignment, but only emits
>> a one-time btrfs_warn() and returns false without failing the
>> allocation. This allows the extent buffer to be created with a
>> misaligned start.
>>
>> Later, during transaction commit triggered by balance, write_one_eb()
>> submits the dirty extent buffer for writeback, and
>> btree_csum_one_bio() is called to checksum it before I/O submission.
>> That path calls btrfs_header_bytenr(eb), which expands via
>> BTRFS_SETGET_HEADER_FUNCS to:
>>
>>    folio_address(eb->folios[0]) + offset_in_page(eb->start)
>>
>> With a nodesize-misaligned start, eb->folios[0] does not correspond to
>> a valid direct-mapped kernel address. folio_address() returns the
>> garbage value 0x0005088000000260, and dereferencing +0x30 (the bytenr
>> field offset in struct btrfs_header) triggers the GPF.
>>
>> [FIX]
>> Add a WARN_ON_ONCE() nodesize alignment check at the beginning of
>> btree_csum_one_bio() and return -EIO for misaligned tree blocks.
>>
>> btree_csum_one_bio() already guards against corrupted extent buffer
>> state on the checksum path, and it also revalidates metadata on the
>> write path. The alignment check follows that pattern and must happen
>> before the first access to eb->folios[] via btrfs_header_bytenr().
>>
>> Fixes: 6d3a61945b00 ("btrfs: warn on tree blocks which are not nodesize aligned")
>> Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
>> ---
>> An alternative fix of promoting check_eb_alignment() from warn to error
>> would prevent the misaligned eb from being created at all, but would
>> break mount and repair workflows: users need to be able to read and
>> inspect a filesystem containing legacy misaligned tree blocks in order
>> to run "btrfs balance -m" and correct the alignment.
> 
> While I agree with that I think we should start rejecting such
> filesystems because of the large folio support and because we hopefully
> have spent the grace period without new reports and incidents.

I agree with the idea to reject such tree blocks, but I'm also concerned 
about btrfs-convert.

The original cause of such unalianged tree blocks are btrfs-convert, 
which can create unaligned chunk bytenr, thus resulting all tree blocks 
inside it to be unaligned.

If we want to reject them, I'd prefer to start warning about unaligned 
chunk start first, as btrfs check is already doing such warning.

Only after we haven't received any reports for a while we can change the 
warning to rejection.

Thanks,
Qu

> 
> If you have a crafted image, and possibly a minimal one, I can add it to
> the btrfs-progs fuzzed images so it can be verified as part of the test
> suite.
>