fs/ext4/xattr.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
syzbot has found an out-of-bounds issue in ext4_xattr_set_entry:
==================================================================
BUG: KASAN: out-of-bounds in ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781
Read of size 18446744073709551572 at addr ffff888036426850 by task syz-executor264/5095
CPU: 0 UID: 0 PID: 5095 Comm: syz-executor264 Not tainted 6.11.0-syzkaller-03917-ga940d9a43e62 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
__asan_memmove+0x29/0x70 mm/kasan/shadow.c:94
ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781
[...]
==================================================================
This issue is caused by a negative size in memmove.
We need to check for this.
Fixes: dec214d00e0d ("ext4: xattr inode deduplication")
Reported-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=f792df426ff0f5ceb8d1
Tested-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com
Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com>
---
fs/ext4/xattr.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 46ce2f21fef9..336badb46246 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1776,7 +1776,14 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i,
} else if (s->not_found) {
/* Insert new name. */
size_t size = EXT4_XATTR_LEN(name_len);
- size_t rest = (void *)last - (void *)here + sizeof(__u32);
+ size_t rest;
+
+ if (last < here) {
+ ret = -ENOSPC;
+ goto out;
+ } else {
+ rest = (void *)last - (void *)here + sizeof(__u32);
+ }
memmove((void *)here + size, here, rest);
memset(here, 0, size);
--
2.34.1
--
Best,
Qianqiang Liu
On 2024/9/22 14:42, Qianqiang Liu wrote: > syzbot has found an out-of-bounds issue in ext4_xattr_set_entry: > > ================================================================== > BUG: KASAN: out-of-bounds in ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 > Read of size 18446744073709551572 at addr ffff888036426850 by task syz-executor264/5095 > > CPU: 0 UID: 0 PID: 5095 Comm: syz-executor264 Not tainted 6.11.0-syzkaller-03917-ga940d9a43e62 #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:93 [inline] > dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 > print_address_description mm/kasan/report.c:377 [inline] > print_report+0x169/0x550 mm/kasan/report.c:488 > kasan_report+0x143/0x180 mm/kasan/report.c:601 > kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 > __asan_memmove+0x29/0x70 mm/kasan/shadow.c:94 > ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 > [...] > ================================================================== > > This issue is caused by a negative size in memmove. > We need to check for this. > > Fixes: dec214d00e0d ("ext4: xattr inode deduplication") > Reported-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com > Closes: https://syzkaller.appspot.com/bug?extid=f792df426ff0f5ceb8d1 > Tested-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com > Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com> > --- > fs/ext4/xattr.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c > index 46ce2f21fef9..336badb46246 100644 > --- a/fs/ext4/xattr.c > +++ b/fs/ext4/xattr.c > @@ -1776,7 +1776,14 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, > } else if (s->not_found) { > /* Insert new name. */ > size_t size = EXT4_XATTR_LEN(name_len); > - size_t rest = (void *)last - (void *)here + sizeof(__u32); > + size_t rest; > + > + if (last < here) { > + ret = -ENOSPC; > + goto out; > + } else { > + rest = (void *)last - (void *)here + sizeof(__u32); > + } > > memmove((void *)here + size, here, rest); > memset(here, 0, size); This change just passes syzbot's test cases without fixing the real problem. The root cause of the problem is that the inode's xattr block is marked as free in the block bitmap, so that block is allocated to the ea inode resulting in the data in the xattr block being overwritten, and the last of the second lookups changing resulting in out-of-bounds access. The stack that triggers the problem is as follows: // An inode with an xattr block of 33. __ext4_mark_inode_dirty __ext4_expand_extra_isize ext4_expand_extra_isize_ea ext4_xattr_make_inode_space // Move xattr from inode to block ext4_xattr_move_to_block // Find out if the xattr exists in the block ext4_xattr_block_find // If xattr does not exist, here == last xattr_find_entry // Add a new xattr to the block ext4_xattr_block_set |// xattr is too long, needs an ea inode |ext4_xattr_inode_lookup_create | ext4_xattr_inode_create | ext4_xattr_inode_write | ext4_map_blocks | // xattr block 33 is assigned to the new ea inode | memcpy(bh->b_data, buf, csize) | // The value of xattr overwrites the data in the xattr block. |ext4_xattr_set_entry // Since the contents of the xattr block have changed, // now here == last does not hold, so it is possible to // have last < here and trigger an out-of-bounds access. So I think we should probably add a helper function ext4_mb_block_inuse() that checks if xattr block is free with the block bitmap in check_xattrs(). Or go one step further and add a mechanism like xfs Reverse-Mapping, which makes sure that allocated blocks do point to the target inode, which could replace the current block_validity, and could also be used in future online fscks. Ted, Honza, do you have any thoughts? Regards, Baokun
On Tue 08-10-24 15:40:39, Baokun Li wrote: > On 2024/9/22 14:42, Qianqiang Liu wrote: > > syzbot has found an out-of-bounds issue in ext4_xattr_set_entry: > > > > ================================================================== > > BUG: KASAN: out-of-bounds in ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 > > Read of size 18446744073709551572 at addr ffff888036426850 by task syz-executor264/5095 > > > > CPU: 0 UID: 0 PID: 5095 Comm: syz-executor264 Not tainted 6.11.0-syzkaller-03917-ga940d9a43e62 #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > > Call Trace: > > <TASK> > > __dump_stack lib/dump_stack.c:93 [inline] > > dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 > > print_address_description mm/kasan/report.c:377 [inline] > > print_report+0x169/0x550 mm/kasan/report.c:488 > > kasan_report+0x143/0x180 mm/kasan/report.c:601 > > kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 > > __asan_memmove+0x29/0x70 mm/kasan/shadow.c:94 > > ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 > > [...] > > ================================================================== > > > > This issue is caused by a negative size in memmove. > > We need to check for this. > > > > Fixes: dec214d00e0d ("ext4: xattr inode deduplication") > > Reported-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com > > Closes: https://syzkaller.appspot.com/bug?extid=f792df426ff0f5ceb8d1 > > Tested-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com > > Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com> > > --- > > fs/ext4/xattr.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c > > index 46ce2f21fef9..336badb46246 100644 > > --- a/fs/ext4/xattr.c > > +++ b/fs/ext4/xattr.c > > @@ -1776,7 +1776,14 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, > > } else if (s->not_found) { > > /* Insert new name. */ > > size_t size = EXT4_XATTR_LEN(name_len); > > - size_t rest = (void *)last - (void *)here + sizeof(__u32); > > + size_t rest; > > + > > + if (last < here) { > > + ret = -ENOSPC; > > + goto out; > > + } else { > > + rest = (void *)last - (void *)here + sizeof(__u32); > > + } > > memmove((void *)here + size, here, rest); > > memset(here, 0, size); > This change just passes syzbot's test cases without fixing the real > problem. > > The root cause of the problem is that the inode's xattr block is marked as > free in the block bitmap, so that block is allocated to the ea inode > resulting in the data in the xattr block being overwritten, and the last of > the second lookups changing resulting in out-of-bounds access. > > The stack that triggers the problem is as follows: > > // An inode with an xattr block of 33. > __ext4_mark_inode_dirty > __ext4_expand_extra_isize > ext4_expand_extra_isize_ea > ext4_xattr_make_inode_space > // Move xattr from inode to block > ext4_xattr_move_to_block > // Find out if the xattr exists in the block > ext4_xattr_block_find > // If xattr does not exist, here == last > xattr_find_entry > // Add a new xattr to the block > ext4_xattr_block_set > |// xattr is too long, needs an ea inode > |ext4_xattr_inode_lookup_create > | ext4_xattr_inode_create > | ext4_xattr_inode_write > | ext4_map_blocks > | // xattr block 33 is assigned to the new ea inode > | memcpy(bh->b_data, buf, csize) > | // The value of xattr overwrites the data in the xattr block. > |ext4_xattr_set_entry > // Since the contents of the xattr block have changed, > // now here == last does not hold, so it is possible to > // have last < here and trigger an out-of-bounds access. > > So I think we should probably add a helper function ext4_mb_block_inuse() > that checks if xattr block is free with the block bitmap in check_xattrs(). Well, even that would be a relatively narrow fix. You could have e.g. file reference the xattr block as one of its data blocks and then corrupt xattr contents at unfortunate moment. That will not get fixed by checking whether the block is allocated. These multiply claimed blocks (as e2fsck calls it) are very hard to detect inside the kernel. > Or go one step further and add a mechanism like xfs Reverse-Mapping, which > makes sure that allocated blocks do point to the target inode, which could > replace the current block_validity, and could also be used in future online > fscks. Well, that is a rather big change. It requires significant on-disk format change and also performance cost when to maintain. Furthermore for xattr blocks which can be shared by many inodes it is not even clear how to implement this... So I'm not sure we really want to do this either. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2024/10/9 23:50, Jan Kara wrote: > On Tue 08-10-24 15:40:39, Baokun Li wrote: >> On 2024/9/22 14:42, Qianqiang Liu wrote: >>> syzbot has found an out-of-bounds issue in ext4_xattr_set_entry: >>> >>> ================================================================== >>> BUG: KASAN: out-of-bounds in ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 >>> Read of size 18446744073709551572 at addr ffff888036426850 by task syz-executor264/5095 >>> >>> CPU: 0 UID: 0 PID: 5095 Comm: syz-executor264 Not tainted 6.11.0-syzkaller-03917-ga940d9a43e62 #0 >>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 >>> Call Trace: >>> <TASK> >>> __dump_stack lib/dump_stack.c:93 [inline] >>> dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 >>> print_address_description mm/kasan/report.c:377 [inline] >>> print_report+0x169/0x550 mm/kasan/report.c:488 >>> kasan_report+0x143/0x180 mm/kasan/report.c:601 >>> kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 >>> __asan_memmove+0x29/0x70 mm/kasan/shadow.c:94 >>> ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 >>> [...] >>> ================================================================== >>> >>> This issue is caused by a negative size in memmove. >>> We need to check for this. >>> >>> Fixes: dec214d00e0d ("ext4: xattr inode deduplication") >>> Reported-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com >>> Closes: https://syzkaller.appspot.com/bug?extid=f792df426ff0f5ceb8d1 >>> Tested-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com >>> Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com> >>> --- >>> fs/ext4/xattr.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c >>> index 46ce2f21fef9..336badb46246 100644 >>> --- a/fs/ext4/xattr.c >>> +++ b/fs/ext4/xattr.c >>> @@ -1776,7 +1776,14 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, >>> } else if (s->not_found) { >>> /* Insert new name. */ >>> size_t size = EXT4_XATTR_LEN(name_len); >>> - size_t rest = (void *)last - (void *)here + sizeof(__u32); >>> + size_t rest; >>> + >>> + if (last < here) { >>> + ret = -ENOSPC; >>> + goto out; >>> + } else { >>> + rest = (void *)last - (void *)here + sizeof(__u32); >>> + } >>> memmove((void *)here + size, here, rest); >>> memset(here, 0, size); >> This change just passes syzbot's test cases without fixing the real >> problem. >> >> The root cause of the problem is that the inode's xattr block is marked as >> free in the block bitmap, so that block is allocated to the ea inode >> resulting in the data in the xattr block being overwritten, and the last of >> the second lookups changing resulting in out-of-bounds access. >> >> The stack that triggers the problem is as follows: >> >> // An inode with an xattr block of 33. >> __ext4_mark_inode_dirty >> __ext4_expand_extra_isize >> ext4_expand_extra_isize_ea >> ext4_xattr_make_inode_space >> // Move xattr from inode to block >> ext4_xattr_move_to_block >> // Find out if the xattr exists in the block >> ext4_xattr_block_find >> // If xattr does not exist, here == last >> xattr_find_entry >> // Add a new xattr to the block >> ext4_xattr_block_set >> |// xattr is too long, needs an ea inode >> |ext4_xattr_inode_lookup_create >> | ext4_xattr_inode_create >> | ext4_xattr_inode_write >> | ext4_map_blocks >> | // xattr block 33 is assigned to the new ea inode >> | memcpy(bh->b_data, buf, csize) >> | // The value of xattr overwrites the data in the xattr block. >> |ext4_xattr_set_entry >> // Since the contents of the xattr block have changed, >> // now here == last does not hold, so it is possible to >> // have last < here and trigger an out-of-bounds access. >> >> So I think we should probably add a helper function ext4_mb_block_inuse() >> that checks if xattr block is free with the block bitmap in check_xattrs(). Hi Honza, Thanks so much for your thoughts and feedback! > Well, even that would be a relatively narrow fix. You could have e.g. > file reference the xattr block as one of its data blocks and then corrupt > xattr contents at unfortunate moment. That will not get fixed by checking > whether the block is allocated. These multiply claimed blocks (as e2fsck > calls it) are very hard to detect inside the kernel. Yes, after locating the issue, the first thought was to just get the buffer lock and check xattr magic and xattr block checksum. However, if the block is allocated as an xattr block to another file, the issue may still occur. Therefore we have to make sure that the block has been allocated to the current file. With the block bitmap we can verify that the current block is allocated, but as you pointed out we cannot verify that it is only allocated to the current file. That means we need some means to find the owner of the block by block, and then I came up with xfs Reverse-Mapping. >> Or go one step further and add a mechanism like xfs Reverse-Mapping, which >> makes sure that allocated blocks do point to the target inode, which could >> replace the current block_validity, and could also be used in future online >> fscks. > Well, that is a rather big change. It requires significant on-disk format > change and also performance cost when to maintain. Furthermore for xattr > blocks which can be shared by many inodes it is not even clear how to > implement this... So I'm not sure we really want to do this either. > > Honza Yes, there can be a lot of work involved. * Perhaps we could create an rmap file to store the rmap tree to avoid on-disk format changes. * The performance impact of maintaining rmap really needs to be evaluated, perhaps by writing a simple DEMO to test it. * XFS supports shared blocks(A.K.A. reflink.), so even if the physical blocks are the same, but the inodes are different or the logical blocks are different, they will be recorded multiple times in the tree. So the shared xattr block can be handled similarly. We have plans to support online fsck in the future, and implementing rmap is one of the steps. Perhaps one can wait until rmap is implemented to assess whether it is worth a strict check here. Implementing rmap may take some time, until then we can avoid the problem as much as possible by checking the magic and xattr block csum. Maybe something like this? diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 7647e9f6e190..cd3ae1e3371c 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -1676,6 +1676,13 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, } } + if (WARN_ON_ONCE(last < here)) { + EXT4_ERROR_INODE(inode, "corrupted xattr entries in %s", + is_block ? "block" : "ibody"); + ret = -EFSCORRUPTED; + goto out; + } + /* Check whether we have enough space. */ if (i->value) { size_t free; @@ -1923,6 +1930,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode, } if (s->base) { + struct ext4_xattr_header *hdr; int offset = (char *)s->here - bs->bh->b_data; BUFFER_TRACE(bs->bh, "get_write_access"); @@ -1932,6 +1940,16 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode, goto cleanup; lock_buffer(bs->bh); + hdr = header(s->base); + + if (hdr->h_magic != cpu_to_le32(EXT4_XATTR_MAGIC) || + (ext4_has_metadata_csum(inode->i_sb) && + (ext4_xattr_block_csum(inode, bs->bh->b_blocknr, hdr) != + hdr->h_checksum))) { + unlock_buffer(bs->bh); + error = -EFSCORRUPTED; + goto bad_block; + } if (header(s->base)->h_refcount == cpu_to_le32(1)) { __u32 hash = le32_to_cpu(BHDR(bs->bh)->h_hash); -- Thanks, Baokun
On Fri 11-10-24 10:18:04, Baokun Li wrote: > On 2024/10/9 23:50, Jan Kara wrote: > > > Or go one step further and add a mechanism like xfs Reverse-Mapping, which > > > makes sure that allocated blocks do point to the target inode, which could > > > replace the current block_validity, and could also be used in future online > > > fscks. > > Well, that is a rather big change. It requires significant on-disk format > > change and also performance cost when to maintain. Furthermore for xattr > > blocks which can be shared by many inodes it is not even clear how to > > implement this... So I'm not sure we really want to do this either. > > Yes, there can be a lot of work involved. > > * Perhaps we could create an rmap file to store the rmap tree to avoid > on-disk format changes. > * The performance impact of maintaining rmap really needs to be evaluated, > perhaps by writing a simple DEMO to test it. > * XFS supports shared blocks(A.K.A. reflink.), so even if the physical > blocks are the same, but the inodes are different or the logical blocks > are different, they will be recorded multiple times in the tree. So the > shared xattr block can be handled similarly. > > We have plans to support online fsck in the future, and implementing rmap > is one of the steps. Perhaps one can wait until rmap is implemented to > assess whether it is worth a strict check here. Yes, we could implement something like this be as you wrote, it's going to be a lot of work. We've briefly discussed this with Ted on ext4 call and we came to a conclusion that this is a type of corruption ext4 may never protect agaist. You simply should not mount arbitrarily corrupted filesystems... But if you want to try, sure go ahead :) One relatively easy solution to similar class of problems would be to store the type of metadata buffer inside the buffer_head when we are verifying checksum, clear the info when freeing the block in __ext4_forget(), and fail with EFSCORRUPTED error when one type -> another type transition would happen. > Implementing rmap may take some time, until then we can avoid the problem > as much as possible by checking the magic and xattr block csum. > Maybe something like this? > > diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c > index 7647e9f6e190..cd3ae1e3371c 100644 > --- a/fs/ext4/xattr.c > +++ b/fs/ext4/xattr.c > @@ -1676,6 +1676,13 @@ static int ext4_xattr_set_entry(struct > ext4_xattr_info *i, > } > } > > + if (WARN_ON_ONCE(last < here)) { > + EXT4_ERROR_INODE(inode, "corrupted xattr entries in %s", > + is_block ? "block" : "ibody"); > + ret = -EFSCORRUPTED; > + goto out; > + } > + > /* Check whether we have enough space. */ > if (i->value) { > size_t free; > @@ -1923,6 +1930,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode > *inode, > } > > if (s->base) { > + struct ext4_xattr_header *hdr; > int offset = (char *)s->here - bs->bh->b_data; > > BUFFER_TRACE(bs->bh, "get_write_access"); > @@ -1932,6 +1940,16 @@ ext4_xattr_block_set(handle_t *handle, struct inode > *inode, > goto cleanup; > > lock_buffer(bs->bh); > + hdr = header(s->base); > + > + if (hdr->h_magic != cpu_to_le32(EXT4_XATTR_MAGIC) || > + (ext4_has_metadata_csum(inode->i_sb) && > + (ext4_xattr_block_csum(inode, bs->bh->b_blocknr, hdr) > != > + hdr->h_checksum))) { > + unlock_buffer(bs->bh); > + error = -EFSCORRUPTED; > + goto bad_block; > + } > > if (header(s->base)->h_refcount == cpu_to_le32(1)) { > __u32 hash = le32_to_cpu(BHDR(bs->bh)->h_hash); Hum, there are more places in xattr code that access a buffer that could have been modified. So why do you add check into this place? Is it somehow special? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 2024/10/15 0:31, Jan Kara wrote: > On Fri 11-10-24 10:18:04, Baokun Li wrote: >> On 2024/10/9 23:50, Jan Kara wrote: >>>> Or go one step further and add a mechanism like xfs Reverse-Mapping, which >>>> makes sure that allocated blocks do point to the target inode, which could >>>> replace the current block_validity, and could also be used in future online >>>> fscks. >>> Well, that is a rather big change. It requires significant on-disk format >>> change and also performance cost when to maintain. Furthermore for xattr >>> blocks which can be shared by many inodes it is not even clear how to >>> implement this... So I'm not sure we really want to do this either. >> Yes, there can be a lot of work involved. >> >> * Perhaps we could create an rmap file to store the rmap tree to avoid >> on-disk format changes. >> * The performance impact of maintaining rmap really needs to be evaluated, >> perhaps by writing a simple DEMO to test it. >> * XFS supports shared blocks(A.K.A. reflink.), so even if the physical >> blocks are the same, but the inodes are different or the logical blocks >> are different, they will be recorded multiple times in the tree. So the >> shared xattr block can be handled similarly. >> >> We have plans to support online fsck in the future, and implementing rmap >> is one of the steps. Perhaps one can wait until rmap is implemented to >> assess whether it is worth a strict check here. > Yes, we could implement something like this be as you wrote, it's going to > be a lot of work. We've briefly discussed this with Ted on ext4 call and we > came to a conclusion that this is a type of corruption ext4 may never > protect agaist. You simply should not mount arbitrarily corrupted > filesystems... Thank you for discussing this with Ted. This kind of problem is really tricky. > But if you want to try, sure go ahead :) As server clusters get larger and larger, server maintenance becomes very difficult. Therefore, timely detection of problems (i.e. online scanning, similar to e2fsck -fn) and timely and non-stop fixing of problems (i.e. online fsck, similar to e2fsck -a) have always been the requirements of our customers. Thus online fsck has been on our TODO list, and it's really time to start doing it. 😀 > One relatively easy solution to similar class of problems would be to store > the type of metadata buffer inside the buffer_head when we are verifying > checksum, clear the info when freeing the block in __ext4_forget(), and > fail with EFSCORRUPTED error when one type -> another type transition would > happen. This solution looks great. If we save checksum further, we can sense not only the change of block type, but also the change of block data. But now journal_head takes up bh->b_private, so to record information, we need to add new fields to buffer_head (e.g. b_info), or modify the logic of journal_head (e.g. make bh->b_private hold ext4_bufdata, and include the journal_head in ext4_bufdata. ocfs2 needs a similar adaptation). >> Implementing rmap may take some time, until then we can avoid the problem >> as much as possible by checking the magic and xattr block csum. >> Maybe something like this? >> >> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c >> index 7647e9f6e190..cd3ae1e3371c 100644 >> --- a/fs/ext4/xattr.c >> +++ b/fs/ext4/xattr.c >> @@ -1676,6 +1676,13 @@ static int ext4_xattr_set_entry(struct >> ext4_xattr_info *i, >> } >> } >> >> + if (WARN_ON_ONCE(last < here)) { >> + EXT4_ERROR_INODE(inode, "corrupted xattr entries in %s", >> + is_block ? "block" : "ibody"); >> + ret = -EFSCORRUPTED; >> + goto out; >> + } >> + >> /* Check whether we have enough space. */ >> if (i->value) { >> size_t free; >> @@ -1923,6 +1930,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode >> *inode, >> } >> >> if (s->base) { >> + struct ext4_xattr_header *hdr; >> int offset = (char *)s->here - bs->bh->b_data; >> >> BUFFER_TRACE(bs->bh, "get_write_access"); >> @@ -1932,6 +1940,16 @@ ext4_xattr_block_set(handle_t *handle, struct inode >> *inode, >> goto cleanup; >> >> lock_buffer(bs->bh); >> + hdr = header(s->base); >> + >> + if (hdr->h_magic != cpu_to_le32(EXT4_XATTR_MAGIC) || >> + (ext4_has_metadata_csum(inode->i_sb) && >> + (ext4_xattr_block_csum(inode, bs->bh->b_blocknr, hdr) >> != >> + hdr->h_checksum))) { >> + unlock_buffer(bs->bh); >> + error = -EFSCORRUPTED; >> + goto bad_block; >> + } >> >> if (header(s->base)->h_refcount == cpu_to_le32(1)) { >> __u32 hash = le32_to_cpu(BHDR(bs->bh)->h_hash); > Hum, there are more places in xattr code that access a buffer that could > have been modified. So why do you add check into this place? Is it somehow > special? > > Honza The out-of-bounds access occurs here because the last dentry obtained in ext4_xattr_set_entry() is not the same as the last dentry obtained in ext4_xattr_block_find(). When we modify an xattr, we always hold the inode lock, so the ibody case is not considered. Check_xattrs() is called to do a full check when looking up an xattr, so it's not necessary to consider that either. When inserting an xattr into an xattr block, there are three cases: * xattr block is newly allocated, and the block is allocated only after the entry is set in memory. * xattr block is unshared, insert the new xattr directly. * xattr block is shared, copy the xattr block and insert the new xattr. Only in the last two cases can a multiply claimed xattr block result in out-of-bounds access, so the buffer lock is held to verify that the data is correct. (It looks like kmemdup should be moved to the buffer lock here as well.🤔) Thanks again! Regards, Baokun
On Wed, Oct 16, 2024 at 04:02:40PM +0800, Baokun Li wrote: > As server clusters get larger and larger, server maintenance becomes very > difficult. Therefore, timely detection of problems (i.e. online scanning, > similar to e2fsck -fn) and timely and non-stop fixing of problems (i.e. > online fsck, similar to e2fsck -a) have always been the requirements of > our customers. Thus online fsck has been on our TODO list, and it's really > time to start doing it. 😀 As far as online scaning is concerned, if you are using LVM, we can use a combination of dm-snapshot and e2fsck -fn --- that is what the e2scrub command automates. Online fsck is much harder, since it would require back pointers to do this efficienctly. To do this, a general way of solving this would involve a generalized some kind of b-tree or b+tree where changes are managed via jbd2. This could be used so that (for example) if we had a tree which maps block ranges to an inode number, then given a block number, we can figure out which inode "owns" that block. The harder part is those objects that have multiple forward pointers --- for example an inode might have multiple hard links to multiple directories, so we need to handle this somehow. If we had the jbd2-aware b+tree, we could also use this add support for reflink/clone, which would also be cool. If this is something that your team really weants to work on, what I'd suggest is to create a rough design of what the journaled b+tree would look like, and then implement it first, since this is the prerequisite for a huge number of advanced file system features. Implementation should be done in a way that makes it easy for the code to be usable both in the kernel and in e2fsprogs, since life will be much easier if we have e2fsck and debugfs support for the new file system data structures from the very beginning of the development. If your company is willing to invest in the engineering effort to do this, great! But I have to point out that an alternative approach that you should consider is whether XFS might be a closer match for some of your customers' needs. The advantage of ext4 is that it is much simpler and easier to understand that XFS. But as we add these new features, ext4 will get more complex. And so one of the design considerations we should keep in mind is to keep ext4 as simple and miantainable as possible, even as we add new functionality. Cheers, - Ted
On 2024/10/17 4:47, Theodore Ts'o wrote: > On Wed, Oct 16, 2024 at 04:02:40PM +0800, Baokun Li wrote: >> As server clusters get larger and larger, server maintenance becomes very >> difficult. Therefore, timely detection of problems (i.e. online scanning, >> similar to e2fsck -fn) and timely and non-stop fixing of problems (i.e. >> online fsck, similar to e2fsck -a) have always been the requirements of >> our customers. Thus online fsck has been on our TODO list, and it's really >> time to start doing it. 😀 > As far as online scaning is concerned, if you are using LVM, we can > use a combination of dm-snapshot and e2fsck -fn --- that is what the > e2scrub command automates. Yes, e2scrub is very nice, but it has too many limitations. We have some customers who don't use lvm. > Online fsck is much harder, since it would require back pointers to do > this efficienctly. Indeed, our rough plan is to first implement isolation of abnormal file system resources, so that the system can continue to run normally even when there is an error; then implement online scanning, so that the maintainer can see the health report at any time; and finally implement the most difficult online repair. > To do this, a general way of solving this would > involve a generalized some kind of b-tree or b+tree where changes are > managed via jbd2. This could be used so that (for example) if we had > a tree which maps block ranges to an inode number, then given a block > number, we can figure out which inode "owns" that block. The harder > part is those objects that have multiple forward pointers --- for > example an inode might have multiple hard links to multiple > directories, so we need to handle this somehow. We do need to establish the mapping of physical blocks to inodes and inodes to parent dir. By tree managed by jbd2 do you mean updating the tree when committing to journal? Or are updates to the tree logged to journal? > > If we had the jbd2-aware b+tree, we could also use this add support > for reflink/clone, which would also be cool. Yeah, reflink is pretty cool, we can try it out when the others are done. > > If this is something that your team really weants to work on, what I'd > suggest is to create a rough design of what the journaled b+tree would > look like, and then implement it first, since this is the prerequisite > for a huge number of advanced file system features. Implementation > should be done in a way that makes it easy for the code to be usable > both in the kernel and in e2fsprogs, since life will be much easier if > we have e2fsck and debugfs support for the new file system data > structures from the very beginning of the development. Thank you for your suggestion! This is really key to the development. We'll discuss the overall design internally before consulting the community. > If your company is willing to invest in the engineering effort to do > this, great! But I have to point out that an alternative approach > that you should consider is whether XFS might be a closer match for > some of your customers' needs. The advantage of ext4 is that it is > much simpler and easier to understand that XFS. The XFS maintainability enhancement is something my colleague is working on. But we have a fair number of downstream customers who prefer ext4, so it's worth investing the time to do that. > But as we add these > new features, ext4 will get more complex. And so one of the design > considerations we should keep in mind is to keep ext4 as simple and > miantainable as possible, even as we add new functionality. > > Cheers, > > - Ted Of course! we will keep the code as simple and maintainable as possible. Thanks again for your input! 😉 Cheers, Baokun
On Thu, Oct 17, 2024 at 08:42:59PM +0800, Baokun Li wrote: > Indeed, our rough plan is to first implement isolation of abnormal file > system resources, so that the system can continue to run normally even > when there is an error; then implement online scanning, so that the > maintainer can see the health report at any time; and finally implement > the most difficult online repair. We have some of this already; if a block group has obvious inconsistencies --- for example, if there is an attempt to mark a block or inode as free, but it's already marked as free as the allocation bitmap, we can mark the block group as inconsistent, and then avoid allocating from the block group. That's easy because it's the kind of inconsistency which can be detected locally. The problem comes with those inconsistencies which require a global examination of the file system data structures. For example, is the refcount of an inode correct? Or is a block claimed by more than one inode? The e2scrub approach requires creating a read-only snapshot (which is why we need LVM) and then running e2fsck in userspace, because it does a global examination of all file system data structures. > We do need to establish the mapping of physical blocks to inodes and > inodes to parent dir. By tree managed by jbd2 do you mean updating > the tree when committing to journal? Or are updates to the tree > logged to journal? When we allocate a block, we need to journal the changes to the allocation bitmap. If we are going to also update the reverse mapping data structure, that needs to be journalled also, so that after a crash, the data structures are consistent. - Ted
On 2024/10/17 22:47, Theodore Ts'o wrote: > On Thu, Oct 17, 2024 at 08:42:59PM +0800, Baokun Li wrote: >> Indeed, our rough plan is to first implement isolation of abnormal file >> system resources, so that the system can continue to run normally even >> when there is an error; then implement online scanning, so that the >> maintainer can see the health report at any time; and finally implement >> the most difficult online repair. > We have some of this already; if a block group has obvious > inconsistencies --- for example, if there is an attempt to mark a > block or inode as free, but it's already marked as free as the > allocation bitmap, we can mark the block group as inconsistent, and > then avoid allocating from the block group. That's easy because it's > the kind of inconsistency which can be detected locally. Yes, there is now block group level isolation. Our goal is to further reduce the scope of isolation to minimise the impact of isolation. The rough idea is to isolate resources where errors are reported, and throw errors when isolation is not possible. This may be a bit crude, but after implementing inline scanning we can achieve precise fine-grained isolation. > The problem comes with those inconsistencies which require a global > examination of the file system data structures. For example, is the > refcount of an inode correct? Or is a block claimed by more than one > inode? The e2scrub approach requires creating a read-only snapshot > (which is why we need LVM) and then running e2fsck in userspace, > because it does a global examination of all file system data > structures. Indeed, consistency is a tricky issue, and we'll focus on that piece of logic. >> We do need to establish the mapping of physical blocks to inodes and >> inodes to parent dir. By tree managed by jbd2 do you mean updating >> the tree when committing to journal? Or are updates to the tree >> logged to journal? > When we allocate a block, we need to journal the changes to the > allocation bitmap. If we are going to also update the reverse mapping > data structure, that needs to be journalled also, so that after a > crash, the data structures are consistent. > > - Ted > Of course, we have to make sure that the metadata modification and the tree update are in the same transaction, otherwise there is no guarantee that the metadata is consistent. Thank you for your input! Regards, Baokun
On Sun, Sep 22, 2024 at 02:42:49PM +0800, Qianqiang Liu wrote: > syzbot has found an out-of-bounds issue in ext4_xattr_set_entry: > > ================================================================== > BUG: KASAN: out-of-bounds in ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 > Read of size 18446744073709551572 at addr ffff888036426850 by task syz-executor264/5095 > > CPU: 0 UID: 0 PID: 5095 Comm: syz-executor264 Not tainted 6.11.0-syzkaller-03917-ga940d9a43e62 #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > Call Trace: > <TASK> > __dump_stack lib/dump_stack.c:93 [inline] > dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119 > print_address_description mm/kasan/report.c:377 [inline] > print_report+0x169/0x550 mm/kasan/report.c:488 > kasan_report+0x143/0x180 mm/kasan/report.c:601 > kasan_check_range+0x282/0x290 mm/kasan/generic.c:189 > __asan_memmove+0x29/0x70 mm/kasan/shadow.c:94 > ext4_xattr_set_entry+0x8ce/0x1f60 fs/ext4/xattr.c:1781 > [...] > ================================================================== > > This issue is caused by a negative size in memmove. > We need to check for this. > > Fixes: dec214d00e0d ("ext4: xattr inode deduplication") > Reported-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com > Closes: https://syzkaller.appspot.com/bug?extid=f792df426ff0f5ceb8d1 > Tested-by: syzbot+f792df426ff0f5ceb8d1@syzkaller.appspotmail.com > Signed-off-by: Qianqiang Liu <qianqiang.liu@163.com> > --- > fs/ext4/xattr.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c > index 46ce2f21fef9..336badb46246 100644 > --- a/fs/ext4/xattr.c > +++ b/fs/ext4/xattr.c > @@ -1776,7 +1776,14 @@ static int ext4_xattr_set_entry(struct ext4_xattr_info *i, > } else if (s->not_found) { > /* Insert new name. */ > size_t size = EXT4_XATTR_LEN(name_len); > - size_t rest = (void *)last - (void *)here + sizeof(__u32); > + size_t rest; > + > + if (last < here) { > + ret = -ENOSPC; > + goto out; > + } else { > + rest = (void *)last - (void *)here + sizeof(__u32); > + } Hey Qianqiang, Thanks for the patch. I'm still reviewing this codepath but I do have some questions around the patch. So I understand that xattrs are arranged in the following format: * +------------------+ * | header | * | entry 1 | * | entry 2 | * | entry 3 | * | four null bytes | <--- last * | . . . | * | . . . | <--- here * | . . . | * | value 1 | * | value 3 | * | value 2 | * +------------------+ Now, in this error, my understanding is that we are actually ending up in a case where "here" ie the place where the new xattr entry will go is beyond the "last" ie the last slot for xattr entry and that is causing an underflow, something like the above diagram. My only concern is that why were we not able to detect this in the logic near the start of the function where we explcity check if we have enough space? Perhaps we should be fixing the logic in that if {..} instead since the comment a few lines above your fix: /* No failures allowed past this point. */ does suggest that we can't error out below that point, so ideally all the checks would have been done before that. I'm still going through the issue, will update here if needed. Regards, ojaswin > > memmove((void *)here + size, here, rest); > memset(here, 0, size); > -- > 2.34.1 > > -- > Best, > Qianqiang Liu >
Hi Ojaswin, On Tue, Oct 01, 2024 at 03:11:44PM +0530, Ojaswin Mujoo wrote: > > Hey Qianqiang, > > Thanks for the patch. I'm still reviewing this codepath but I do have > some questions around the patch. So I understand that xattrs are > arranged in the following format: > > * +------------------+ > * | header | > * | entry 1 | > * | entry 2 | > * | entry 3 | > * | four null bytes | <--- last > * | . . . | > * | . . . | <--- here > * | . . . | > * | value 1 | > * | value 3 | > * | value 2 | > * +------------------+ > > Now, in this error, my understanding is that we are actually ending up > in a case where "here" ie the place where the new xattr entry will go is > beyond the "last" ie the last slot for xattr entry and that is causing > an underflow, something like the above diagram. > > My only concern is that why were we not able to detect this in the logic > near the start of the function where we explcity check if we have enough > space? > > Perhaps we should be fixing the logic in that if {..} instead > since the comment a few lines above your fix: > > /* No failures allowed past this point. */ > > does suggest that we can't error out below that point, so ideally all > the checks would have been done before that. > > I'm still going through the issue, will update here if needed. > > Regards, > ojaswin > I reviewed the codepath, and here is the backtrace when the error occurs: => vfs_unlink => ext4_unlink => __ext4_unlink => __ext4_mark_inode_dirty => ext4_try_to_expand_extra_isize => __ext4_expand_extra_isize => ext4_expand_extra_isize_ea => ext4_xattr_make_inode_space => ext4_xattr_move_to_block -> ext4_xattr_block_find -> xattr_find_entry => ext4_xattr_block_set => ext4_xattr_set_entry => memmove((void *)here + size, here, rest); The xattr_find_entry function return -ENODATA, but beacuase of the following code, the error does not be returned to caller: static int ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i, struct ext4_xattr_block_find *bs) { ... error = xattr_find_entry(inode, &bs->s.here, bs->s.end, i->name_index, i->name, 1); if (error && error != -ENODATA) return error; ... } So, perhaps we could modify the code as follows: diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index e0e1956dcdd3..649b278d4c1f 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -1884,7 +1884,7 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i, bs->s.here = bs->s.first; error = xattr_find_entry(inode, &bs->s.here, bs->s.end, i->name_index, i->name, 1); - if (error && error != -ENODATA) + if (error) return error; bs->s.not_found = error; } Or, we could check if s->not_found is -ENODATA in the ext4_xattr_set_entry function. Any suggestions? -- Best, Qianqiang Liu
Hi Ojaswin, On Tue, Oct 01, 2024 at 03:11:44PM +0530, Ojaswin Mujoo wrote: > > Hey Qianqiang, > > Thanks for the patch. I'm still reviewing this codepath but I do have > some questions around the patch. So I understand that xattrs are > arranged in the following format: > > * +------------------+ > * | header | > * | entry 1 | > * | entry 2 | > * | entry 3 | > * | four null bytes | <--- last > * | . . . | > * | . . . | <--- here > * | . . . | > * | value 1 | > * | value 3 | > * | value 2 | > * +------------------+ > > Now, in this error, my understanding is that we are actually ending up > in a case where "here" ie the place where the new xattr entry will go is > beyond the "last" ie the last slot for xattr entry and that is causing > an underflow, something like the above diagram. > > My only concern is that why were we not able to detect this in the logic > near the start of the function where we explcity check if we have enough > space? > > Perhaps we should be fixing the logic in that if {..} instead > since the comment a few lines above your fix: > > /* No failures allowed past this point. */ > > does suggest that we can't error out below that point, so ideally all > the checks would have been done before that. > > I'm still going through the issue, will update here if needed. > > Regards, > ojaswin > Thank you for your suggestions. I will investigate this issue further. If there are any updates, I will let you know. -- Best, Qianqiang Liu
© 2016 - 2024 Red Hat, Inc.