fs: preserve file type in make_bad_inode() unless invalid

[PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Tetsuo Handa 2 months ago

syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
to S_IFREG. Since make_bad_inode() might be called after an inode is fully
constructed, make_bad_inode() should not needlessly change file type.

Reported-by: syzbot+d222f4b7129379c3d5bc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d222f4b7129379c3d5bc
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
Should we implement all callbacks (except get_offset_ctx callback which is
currently used by only tmpfs which does not call make_bad_inode()) within
bad_inode_ops, for there might be a callback which is expected to be non-NULL
for !S_IFREG types? Implementing missing callbacks is good for eliminating
possibility of NULL function pointer call. Since VFS is using

    if (!inode->i_op->foo)
        return error;
    inode->i_op->foo();

pattern instead of

    pFoo = READ_ONCE(inode->i_op->foo)
    if (!pFoo)
        return error;
    pFoo();

pattern, suddenly replacing "one i_op with i_op->foo != NULL" with "another
i_op with i_op->foo == NULL" has possibility of NULL pointer function call
(e.g. https://lkml.kernel.org/r/18a58415-4aa9-4cba-97d2-b70384407313@I-love.SAKURA.ne.jp ).
If we implement missing callbacks, e.g. vfs_fileattr_get() will start
calling security_inode_file_getattr() on bad inode, but we can eliminate
possibility of inode->i_op->fileattr_get == NULL when make_bad_inode() is
called from security_inode_file_getattr() for some reason.

 fs/bad_inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 0ef9bcb744dd..ff6c2daecd1c 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -207,7 +207,19 @@ void make_bad_inode(struct inode *inode)
 {
 	remove_inode_hash(inode);
 
-	inode->i_mode = S_IFREG;
+	switch (inode->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFIFO:
+	case S_IFSOCK:
+		inode->i_mode &= S_IFMT;
+		break;
+	default:
+		inode->i_mode = S_IFREG;
+	}
 	simple_inode_init_ts(inode);
 	inode->i_op = &bad_inode_ops;	
 	inode->i_opflags &= ~IOP_XATTR;
-- 
2.47.3

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Jan Kara 2 months ago

On Wed 10-12-25 18:45:26, Tetsuo Handa wrote:
> syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> constructed, make_bad_inode() should not needlessly change file type.
> 
> Reported-by: syzbot+d222f4b7129379c3d5bc@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=d222f4b7129379c3d5bc
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

No. make_bad_inode() must not be called once the inode is fully visible
because that can cause all sorts of fun. That function is really only good
for handling a situation when read of an inode from the disk failed or
similar early error paths. It would be great if make_bad_inode() could do
something like:

	VFS_BUG_ON_INODE(!(inode_state_read_once(inode) & I_NEW));

but sadly that is not currently possible because inodes start with i_state
set to 0 and some places do call make_bad_inode() before I_NEW is set in
i_state. Matheusz wanted to clean that up a bit AFAIK.

Until the cleanup is done, perhaps we could add:

	VFS_BUG_ON_INODE(inode->i_dentry->first);

to make_bad_inode() and watch the fireworks from syzbot. But at least the
bugs would be attributed to the place where they are happening.

								Honza

> ---
> Should we implement all callbacks (except get_offset_ctx callback which is
> currently used by only tmpfs which does not call make_bad_inode()) within
> bad_inode_ops, for there might be a callback which is expected to be non-NULL
> for !S_IFREG types? Implementing missing callbacks is good for eliminating
> possibility of NULL function pointer call. Since VFS is using
> 
>     if (!inode->i_op->foo)
>         return error;
>     inode->i_op->foo();
> 
> pattern instead of
> 
>     pFoo = READ_ONCE(inode->i_op->foo)
>     if (!pFoo)
>         return error;
>     pFoo();
> 
> pattern, suddenly replacing "one i_op with i_op->foo != NULL" with "another
> i_op with i_op->foo == NULL" has possibility of NULL pointer function call
> (e.g. https://lkml.kernel.org/r/18a58415-4aa9-4cba-97d2-b70384407313@I-love.SAKURA.ne.jp ).
> If we implement missing callbacks, e.g. vfs_fileattr_get() will start
> calling security_inode_file_getattr() on bad inode, but we can eliminate
> possibility of inode->i_op->fileattr_get == NULL when make_bad_inode() is
> called from security_inode_file_getattr() for some reason.
> 
>  fs/bad_inode.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/bad_inode.c b/fs/bad_inode.c
> index 0ef9bcb744dd..ff6c2daecd1c 100644
> --- a/fs/bad_inode.c
> +++ b/fs/bad_inode.c
> @@ -207,7 +207,19 @@ void make_bad_inode(struct inode *inode)
>  {
>  	remove_inode_hash(inode);
>  
> -	inode->i_mode = S_IFREG;
> +	switch (inode->i_mode & S_IFMT) {
> +	case S_IFREG:
> +	case S_IFDIR:
> +	case S_IFLNK:
> +	case S_IFCHR:
> +	case S_IFBLK:
> +	case S_IFIFO:
> +	case S_IFSOCK:
> +		inode->i_mode &= S_IFMT;
> +		break;
> +	default:
> +		inode->i_mode = S_IFREG;
> +	}
>  	simple_inode_init_ts(inode);
>  	inode->i_op = &bad_inode_ops;	
>  	inode->i_opflags &= ~IOP_XATTR;
> -- 
> 2.47.3
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Tetsuo Handa 1 month ago

On 2025/12/10 19:09, Jan Kara wrote:
> On Wed 10-12-25 18:45:26, Tetsuo Handa wrote:
>> syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
>> introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
>> handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
>> to S_IFREG. Since make_bad_inode() might be called after an inode is fully
>> constructed, make_bad_inode() should not needlessly change file type.
>>
>> Reported-by: syzbot+d222f4b7129379c3d5bc@syzkaller.appspotmail.com
>> Closes: https://syzkaller.appspot.com/bug?extid=d222f4b7129379c3d5bc
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> No. make_bad_inode() must not be called once the inode is fully visible
> because that can cause all sorts of fun. That function is really only good
> for handling a situation when read of an inode from the disk failed or
> similar early error paths.
I'm surprised to hear that.

But since commit 58b6fcd2ab34 ("ocfs2: mark inode bad upon validation failure
during read") is a bug fix, we want to somehow prevent this bug from re-opening.

Minimal change for this release cycle might look like

----------
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index b5fcc2725a29..2c97c8b4013f 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1715,8 +1715,13 @@ int ocfs2_read_inode_block_full(struct inode *inode, struct buffer_head **bh,
 	rc = ocfs2_read_blocks(INODE_CACHE(inode), OCFS2_I(inode)->ip_blkno,
 			       1, &tmp, flags, ocfs2_validate_inode_block);
 
-	if (rc < 0)
+	if (rc < 0) {
+		/* Preserve file type while making operations no-op. */
+		umode_t	mode = inode->i_mode & S_IFMT;
+
 		make_bad_inode(inode);
+		inode->i_mode = mode;
+	}
 	/* If ocfs2_read_blocks() got us a new bh, pass it up. */
 	if (!rc && !*bh)
 		*bh = tmp;
----------

but what approach do you prefer?

Introduce a copy of bad_{inode,file}_ops for ocfs2 and replace
a call to make_bad_inode() with updating only {inode,file}_ops ?

Or, modify existing {inode,file}_ops for ocfs2 to check whether
an I/O error has occurred in the past?

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Jan Kara 1 month ago

On Tue 06-01-26 19:10:41, Tetsuo Handa wrote:
> On 2025/12/10 19:09, Jan Kara wrote:
> > On Wed 10-12-25 18:45:26, Tetsuo Handa wrote:
> >> syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> >> introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> >> handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> >> to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> >> constructed, make_bad_inode() should not needlessly change file type.
> >>
> >> Reported-by: syzbot+d222f4b7129379c3d5bc@syzkaller.appspotmail.com
> >> Closes: https://syzkaller.appspot.com/bug?extid=d222f4b7129379c3d5bc
> >> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > 
> > No. make_bad_inode() must not be called once the inode is fully visible
> > because that can cause all sorts of fun. That function is really only good
> > for handling a situation when read of an inode from the disk failed or
> > similar early error paths.
> I'm surprised to hear that.
> 
> But since commit 58b6fcd2ab34 ("ocfs2: mark inode bad upon validation
> failure during read") is a bug fix, we want to somehow prevent this bug
> from re-opening.

Since Jens has picked up
https://lore.kernel.org/all/20251217190040.490204-2-rpthibeault@gmail.com/
yesterday I suspect the original reproducer for OCFS2 will not cause issue
anymore even without 58b6fcd2ab34 because as far as I had a look the
original problem was caused by the loop device getting messed up under a
mounted OCFS2 filesystem. It would be good to verify my analysis is correct
but I think just reverting 58b6fcd2ab34 might be the best option at this
point.

								Honza

> 
> Minimal change for this release cycle might look like
> 
> ----------
> diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
> index b5fcc2725a29..2c97c8b4013f 100644
> --- a/fs/ocfs2/inode.c
> +++ b/fs/ocfs2/inode.c
> @@ -1715,8 +1715,13 @@ int ocfs2_read_inode_block_full(struct inode *inode, struct buffer_head **bh,
>  	rc = ocfs2_read_blocks(INODE_CACHE(inode), OCFS2_I(inode)->ip_blkno,
>  			       1, &tmp, flags, ocfs2_validate_inode_block);
>  
> -	if (rc < 0)
> +	if (rc < 0) {
> +		/* Preserve file type while making operations no-op. */
> +		umode_t	mode = inode->i_mode & S_IFMT;
> +
>  		make_bad_inode(inode);
> +		inode->i_mode = mode;
> +	}
>  	/* If ocfs2_read_blocks() got us a new bh, pass it up. */
>  	if (!rc && !*bh)
>  		*bh = tmp;
> ----------
> 
> but what approach do you prefer?
> 
> Introduce a copy of bad_{inode,file}_ops for ocfs2 and replace
> a call to make_bad_inode() with updating only {inode,file}_ops ?
> 
> Or, modify existing {inode,file}_ops for ocfs2 to check whether
> an I/O error has occurred in the past?
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Tetsuo Handa 1 month ago

On 2026/01/07 18:36, Jan Kara wrote:
> Since Jens has picked up
> https://lore.kernel.org/all/20251217190040.490204-2-rpthibeault@gmail.com/
> yesterday I suspect the original reproducer for OCFS2 will not cause issue
> anymore even without 58b6fcd2ab34 because as far as I had a look the
> original problem was caused by the loop device getting messed up under a
> mounted OCFS2 filesystem. It would be good to verify my analysis is correct
> but I think just reverting 58b6fcd2ab34 might be the best option at this
> point.

Well, that patch wants

  	err = mutex_lock_killable(&lo->lo_mutex);
- 	if (err)
+ 	if (err) {
+		if (!(mode & BLK_OPEN_EXCL))
+			bd_abort_claiming(bdev, loop_set_status);
  		return err;
+	}

change...

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Jan Kara 1 month ago

On Wed 07-01-26 18:53:40, Tetsuo Handa wrote:
> On 2026/01/07 18:36, Jan Kara wrote:
> > Since Jens has picked up
> > https://lore.kernel.org/all/20251217190040.490204-2-rpthibeault@gmail.com/
> > yesterday I suspect the original reproducer for OCFS2 will not cause issue
> > anymore even without 58b6fcd2ab34 because as far as I had a look the
> > original problem was caused by the loop device getting messed up under a
> > mounted OCFS2 filesystem. It would be good to verify my analysis is correct
> > but I think just reverting 58b6fcd2ab34 might be the best option at this
> > point.
> 
> Well, that patch wants
> 
>   	err = mutex_lock_killable(&lo->lo_mutex);
> - 	if (err)
> + 	if (err) {
> +		if (!(mode & BLK_OPEN_EXCL))
> +			bd_abort_claiming(bdev, loop_set_status);
>   		return err;
> +	}
> 
> change...

Ah, a good catch! Please send a fixup to Jens. Thanks!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Mateusz Guzik 2 months ago

On Wed, Dec 10, 2025 at 11:09 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 10-12-25 18:45:26, Tetsuo Handa wrote:
> > syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> > introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> > handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> > to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> > constructed, make_bad_inode() should not needlessly change file type.
> >
> > Reported-by: syzbot+d222f4b7129379c3d5bc@syzkaller.appspotmail.com
> > Closes: https://syzkaller.appspot.com/bug?extid=d222f4b7129379c3d5bc
> > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>
> No. make_bad_inode() must not be called once the inode is fully visible
> because that can cause all sorts of fun. That function is really only good
> for handling a situation when read of an inode from the disk failed or
> similar early error paths. It would be great if make_bad_inode() could do
> something like:
>
>         VFS_BUG_ON_INODE(!(inode_state_read_once(inode) & I_NEW));
>
> but sadly that is not currently possible because inodes start with i_state
> set to 0 and some places do call make_bad_inode() before I_NEW is set in
> i_state. Matheusz wanted to clean that up a bit AFAIK.
>

[ most unfortunate timing, I just sent an e-mail with an assumption
that make_bad_inode() has to be callable after the inode+dentries got
published. :> ]

I'm delighted to see the call is considered bogus.

As for being able to assert on it, I noted the current flag handling
for lifecycle tracking is unhelpful.

Per your response, i_state == 0 is overloaded to mean the inode is
fully sorted out *and* that it is brand new.

Instead clear-cut indicators are needed to track where the inode is in
its lifecycle.

I proposed 2 ways: a dedicated enum or fucking around with flags.

Indeed the easiest stepping stone for the time being would be to push
up I_NEW to alloc_inode and assert on it in places which set the flag.
I'm going to cook it up.

> Until the cleanup is done, perhaps we could add:
>
>         VFS_BUG_ON_INODE(inode->i_dentry->first);
>
> to make_bad_inode() and watch the fireworks from syzbot. But at least the
> bugs would be attributed to the place where they are happening.
>

Note the assert which is currently tripping over is very much
necessary for correctness as the new routine skips checking the type
on its own.

Thus the issue needs to get solved for 6.19.

Trying to weed out all of the make_bad_inode callers is probably too
much for the release cycle.

So I stand by patching this up to a state where the lookup routine can
reliably check that this is what happened, should it find a non-dir
inode and doing a proper fix for the next merge window.

>                                                                 Honza
>
> > ---
> > Should we implement all callbacks (except get_offset_ctx callback which is
> > currently used by only tmpfs which does not call make_bad_inode()) within
> > bad_inode_ops, for there might be a callback which is expected to be non-NULL
> > for !S_IFREG types? Implementing missing callbacks is good for eliminating
> > possibility of NULL function pointer call. Since VFS is using
> >
> >     if (!inode->i_op->foo)
> >         return error;
> >     inode->i_op->foo();
> >
> > pattern instead of
> >
> >     pFoo = READ_ONCE(inode->i_op->foo)
> >     if (!pFoo)
> >         return error;
> >     pFoo();
> >
> > pattern, suddenly replacing "one i_op with i_op->foo != NULL" with "another
> > i_op with i_op->foo == NULL" has possibility of NULL pointer function call
> > (e.g. https://lkml.kernel.org/r/18a58415-4aa9-4cba-97d2-b70384407313@I-love.SAKURA.ne.jp ).
> > If we implement missing callbacks, e.g. vfs_fileattr_get() will start
> > calling security_inode_file_getattr() on bad inode, but we can eliminate
> > possibility of inode->i_op->fileattr_get == NULL when make_bad_inode() is
> > called from security_inode_file_getattr() for some reason.
> >
> >  fs/bad_inode.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/bad_inode.c b/fs/bad_inode.c
> > index 0ef9bcb744dd..ff6c2daecd1c 100644
> > --- a/fs/bad_inode.c
> > +++ b/fs/bad_inode.c
> > @@ -207,7 +207,19 @@ void make_bad_inode(struct inode *inode)
> >  {
> >       remove_inode_hash(inode);
> >
> > -     inode->i_mode = S_IFREG;
> > +     switch (inode->i_mode & S_IFMT) {
> > +     case S_IFREG:
> > +     case S_IFDIR:
> > +     case S_IFLNK:
> > +     case S_IFCHR:
> > +     case S_IFBLK:
> > +     case S_IFIFO:
> > +     case S_IFSOCK:
> > +             inode->i_mode &= S_IFMT;
> > +             break;
> > +     default:
> > +             inode->i_mode = S_IFREG;
> > +     }
> >       simple_inode_init_ts(inode);
> >       inode->i_op = &bad_inode_ops;
> >       inode->i_opflags &= ~IOP_XATTR;
> > --
> > 2.47.3
> >
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Al Viro 2 months ago

On Wed, Dec 10, 2025 at 11:24:40AM +0100, Mateusz Guzik wrote:

> I'm delighted to see the call is considered bogus.
> 
> As for being able to assert on it, I noted the current flag handling
> for lifecycle tracking is unhelpful.
> 
> Per your response, i_state == 0 is overloaded to mean the inode is
> fully sorted out *and* that it is brand new.
> 
> Instead clear-cut indicators are needed to track where the inode is in
> its lifecycle.
> 
> I proposed 2 ways: a dedicated enum or fucking around with flags.
> 
> Indeed the easiest stepping stone for the time being would be to push
> up I_NEW to alloc_inode and assert on it in places which set the flag.
> I'm going to cook it up.

You are misinterpreting what I_NEW is about - it is badly named, no
arguments here, but it's _not_ "inode is new".

It's "it's in inode hash, but if you find it on lookup, you'll need to wait -
it's not entirely set up".

A plenty of inodes never enter that state at all.  Hell, consider pipes.
Or sockets.  Or anything on procfs.  Or sysfs, or...

We never look those up by inumber and there'd be no sane way to do that
anyway.  They never get hashed, nor should they.

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Mateusz Guzik 2 months ago

On Wed, Dec 10, 2025 at 10:13 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Dec 10, 2025 at 11:24:40AM +0100, Mateusz Guzik wrote:
>
> > I'm delighted to see the call is considered bogus.
> >
> > As for being able to assert on it, I noted the current flag handling
> > for lifecycle tracking is unhelpful.
> >
> > Per your response, i_state == 0 is overloaded to mean the inode is
> > fully sorted out *and* that it is brand new.
> >
> > Instead clear-cut indicators are needed to track where the inode is in
> > its lifecycle.
> >
> > I proposed 2 ways: a dedicated enum or fucking around with flags.
> >
> > Indeed the easiest stepping stone for the time being would be to push
> > up I_NEW to alloc_inode and assert on it in places which set the flag.
> > I'm going to cook it up.
>
> You are misinterpreting what I_NEW is about - it is badly named, no
> arguments here, but it's _not_ "inode is new".
>
> It's "it's in inode hash, but if you find it on lookup, you'll need to wait -
> it's not entirely set up".
>

Comments in the hash code make it pretty clear. The above is a part of
a bigger picture, which I already talked about in the 'light refcount'
patchset or whatever the name was.

The general problem statement is that the VFS layer suffers from a
chronic lack of assertions, which in turn helps people add latent bugs
(case in point: make_bad_inode() armed with asserts on state would
have blown up immediately and this entire thread would have been
avoided).

One of the things missing to create good coverage is reliable inode
lifecycle tracking. Currently *some of it* can be determined with some
of the flags in ->i_state, but even there are important states which
are straight up missing, notably whether the filesystem claims the
inode is ready to use *or* not ready at all (not even in the hash) is
indistinguishable by ->i_state. Trying to figure it out by other means
is avoidable tech debt. Bare minimum denoted states have to
distinguish between just allocated, creation aborted, ready to use, in
teardown and finally torn down.

An important part is validating whether inode at hand adheres to the
API contract when the filesystem claims it is ready. For example
->i_mode has to have a valid type set, but syzkaller convinced ntfs to
let an inode with invalid mode get out, which later resulted in a warn
in execve code because may_open() did not apply any of the checks to
it. This is the kind of a problem which can and should be checked for
before the inode is allowed to be used.

Another benefit is that some of the state can be pre-computed. For example this:
 static inline int do_inode_permission(struct mnt_idmap *idmap,
                                        struct inode *inode, int mask)
  {
          if (unlikely(!(inode->i_opflags & IOP_FASTPERM))) {
                  if (likely(inode->i_op->permission))
                          return inode->i_op->permission(idmap, inode,
mask);

                  /* This gets set once for the inode lifetime */
                  spin_lock(&inode->i_lock);
                  inode->i_opflags |= IOP_FASTPERM;
                  spin_unlock(&inode->i_lock);
          }
          return generic_permission(idmap, inode, mask);
  }

... will stop setting the flag as this aspect will be already sorted
out. There is another crapper of the sort in vfs_readlink and one can
suspect more will show up over time.

Back to lifecycle tracking, I_NEW could change semantics to mean "this
inode *is not* ready for use yet". unlock_new_inode() already serves
as a "this inode *is* ready for use" indicator, it just happens to not
be mandatory to call. Another routine could be added for filesystems
which don't use the hash to cover that gap.

Then for filesystems which *do* use the hash the entire thing is transparent.

> A plenty of inodes never enter that state at all.  Hell, consider pipes.
> Or sockets.  Or anything on procfs.  Or sysfs, or...
>

So whether I change the meaning of I_NEW or add another flag, I will
still need to patch these suckers to do *something* on the inodes they
create.

That's not the whole story, but should be enough to convey what I'm gunning for.

This will also have a side effect of giving I_NEW a more fitting use.

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Mateusz Guzik 2 months ago

On Wed, Dec 10, 2025 at 10:45 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> constructed, make_bad_inode() should not needlessly change file type.
>

ouch

So let's say calls to make_bad_inode *after* d_instantiate are unavoidable.

While screwing around with inode type for bogus inodes has merit,
switching things up from under dentry is bogus in its own right and
the choice of ISREG is questionable at best -- as is the call
introduces internally inconsistent state, in the case of the original
report a dentry claiming it's a directory pointing to a regular inode.

The non-screwed way of handling this would introduce a known BAD inode
type and patch up dentries as needed as well, but that might be a lot
of churn for not that much benefit.

As is, I don't know if leaving the type unchanged is safe -- there
might be nasty cases which depend on the change.

At the same time I claim the assert I introduced is mandatory.

Absent thorough audit, I think the pragmatic way forward for the time
being is to give code a chance to assert correctness and adjust the
assert in lookup code accordingly.

Right now that's not possible since the reassignment of type happens
without the inode spinlock held. Since make_bad_inode() calls into
remove_inode_hash which takes the spinlock, it should be safe to also
take it around the reassignment.

Then code wishing to assert on type race-free can take the spinlock to
stabilize it.

> Reported-by: syzbot+d222f4b7129379c3d5bc@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=d222f4b7129379c3d5bc
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
> Should we implement all callbacks (except get_offset_ctx callback which is
> currently used by only tmpfs which does not call make_bad_inode()) within
> bad_inode_ops, for there might be a callback which is expected to be non-NULL
> for !S_IFREG types? Implementing missing callbacks is good for eliminating
> possibility of NULL function pointer call. Since VFS is using
>
>     if (!inode->i_op->foo)
>         return error;
>     inode->i_op->foo();
>
> pattern instead of
>
>     pFoo = READ_ONCE(inode->i_op->foo)
>     if (!pFoo)
>         return error;
>     pFoo();
>
> pattern, suddenly replacing "one i_op with i_op->foo != NULL" with "another
> i_op with i_op->foo == NULL" has possibility of NULL pointer function call
> (e.g. https://lkml.kernel.org/r/18a58415-4aa9-4cba-97d2-b70384407313@I-love.SAKURA.ne.jp ).
> If we implement missing callbacks, e.g. vfs_fileattr_get() will start
> calling security_inode_file_getattr() on bad inode, but we can eliminate
> possibility of inode->i_op->fileattr_get == NULL when make_bad_inode() is
> called from security_inode_file_getattr() for some reason.
>
>  fs/bad_inode.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/fs/bad_inode.c b/fs/bad_inode.c
> index 0ef9bcb744dd..ff6c2daecd1c 100644
> --- a/fs/bad_inode.c
> +++ b/fs/bad_inode.c
> @@ -207,7 +207,19 @@ void make_bad_inode(struct inode *inode)
>  {
>         remove_inode_hash(inode);
>
> -       inode->i_mode = S_IFREG;
> +       switch (inode->i_mode & S_IFMT) {
> +       case S_IFREG:
> +       case S_IFDIR:
> +       case S_IFLNK:
> +       case S_IFCHR:
> +       case S_IFBLK:
> +       case S_IFIFO:
> +       case S_IFSOCK:
> +               inode->i_mode &= S_IFMT;
> +               break;
> +       default:
> +               inode->i_mode = S_IFREG;
> +       }
>         simple_inode_init_ts(inode);
>         inode->i_op = &bad_inode_ops;
>         inode->i_opflags &= ~IOP_XATTR;
> --
> 2.47.3
>
>

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Al Viro 2 months ago

On Wed, Dec 10, 2025 at 11:09:24AM +0100, Mateusz Guzik wrote:
> On Wed, Dec 10, 2025 at 10:45 AM Tetsuo Handa
> <penguin-kernel@i-love.sakura.ne.jp> wrote:
> >
> > syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> > introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> > handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> > to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> > constructed, make_bad_inode() should not needlessly change file type.
> >
> 
> ouch
> 
> So let's say calls to make_bad_inode *after* d_instantiate are unavoidable.

... and each one is a bug.

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Al Viro 2 months ago

On Wed, Dec 10, 2025 at 03:35:31PM +0000, Al Viro wrote:
> On Wed, Dec 10, 2025 at 11:09:24AM +0100, Mateusz Guzik wrote:
> > On Wed, Dec 10, 2025 at 10:45 AM Tetsuo Handa
> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > >
> > > syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> > > introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> > > handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> > > to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> > > constructed, make_bad_inode() should not needlessly change file type.
> > >
> > 
> > ouch
> > 
> > So let's say calls to make_bad_inode *after* d_instantiate are unavoidable.
> 
> ... and each one is a bug.

In this case I strongly suspect that it had been introduced in

commit 58b6fcd2ab34399258dc509f701d0986a8e0bcaa
Author: Ahmet Eray Karadag <eraykrdg1@gmail.com>
Date:   Tue Nov 18 03:18:34 2025 +0300
 
     ocfs2: mark inode bad upon validation failure during read

Folks, make_bad_inode() is *NOT* magic and having it anywhere in "hardening"
patch is a major red flag.  Please, don't do it, and I would recommend
reverting that commit, possibly along with the rest of the series.

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Al Viro 2 months ago

On Wed, Dec 10, 2025 at 03:35:31PM +0000, Al Viro wrote:
> On Wed, Dec 10, 2025 at 11:09:24AM +0100, Mateusz Guzik wrote:
> > On Wed, Dec 10, 2025 at 10:45 AM Tetsuo Handa
> > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > >
> > > syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> > > introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> > > handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> > > to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> > > constructed, make_bad_inode() should not needlessly change file type.
> > >
> > 
> > ouch
> > 
> > So let's say calls to make_bad_inode *after* d_instantiate are unavoidable.
> 
> ... and each one is a bug.

FWIW, I'm very tempted to fold make_bad_inode() into iget_failed().  Other
callers tend to be either pointless (e.g. ext2_new_inode() after reaching
fail: label - we only get there if inode has never reached inode hash
table; make_bad_inode() in there should've been gone for a long time)
or outright broken.

There's not a lot of callers, thankfully; I'm going through those at the
moment, but so far the impression is that we should be able to simply bury
the damn thing.

Re: [PATCH for 6.19-rc1] fs: preserve file type in make_bad_inode() unless invalid

Posted by Al Viro 2 months ago

On Wed, Dec 10, 2025 at 08:43:38PM +0000, Al Viro wrote:
> On Wed, Dec 10, 2025 at 03:35:31PM +0000, Al Viro wrote:
> > On Wed, Dec 10, 2025 at 11:09:24AM +0100, Mateusz Guzik wrote:
> > > On Wed, Dec 10, 2025 at 10:45 AM Tetsuo Handa
> > > <penguin-kernel@i-love.sakura.ne.jp> wrote:
> > > >
> > > > syzbot is hitting VFS_BUG_ON_INODE(!S_ISDIR(inode->i_mode)) check
> > > > introduced by commit e631df89cd5d ("fs: speed up path lookup with cheaper
> > > > handling of MAY_EXEC"), for make_bad_inode() is blindly changing file type
> > > > to S_IFREG. Since make_bad_inode() might be called after an inode is fully
> > > > constructed, make_bad_inode() should not needlessly change file type.
> > > >
> > > 
> > > ouch
> > > 
> > > So let's say calls to make_bad_inode *after* d_instantiate are unavoidable.
> > 
> > ... and each one is a bug.
> 
> FWIW, I'm very tempted to fold make_bad_inode() into iget_failed().  Other
> callers tend to be either pointless (e.g. ext2_new_inode() after reaching
> fail: label - we only get there if inode has never reached inode hash
> table; make_bad_inode() in there should've been gone for a long time)
> or outright broken.
> 
> There's not a lot of callers, thankfully; I'm going through those at the
> moment, but so far the impression is that we should be able to simply bury
> the damn thing.

While we are at it, 73861970938a "minixfs: Verify inode mode when loading from
disk" that introduced one of those is seriously misguided - sanity check belongs
in V1_minix_iget/V2_minix_iget, and should be handled there the same way we
deal with zero i_nlink.

We really ought to take that function out - as it is, it's an attractive
nuisance...