fs/namespace.c: fix mountpath handling in do_lock_mount()

[PATCH] fs/namespace.c: fix mountpath handling in do_lock_mount()

Posted by Ryan Chung 5 months, 3 weeks ago

Updates documentation for do_lock_mount() in fs/namespace.c
to clarify its parameters and return description to fix
warning reported by syzbot.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202506301911.uysRaP8b-lkp@intel.com/
Signed-off-by: Ryan Chung <seokwoo.chung130@gmail.com>
---
 fs/namespace.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ddfd4457d338..577fdff9f1a8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2741,6 +2741,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
 /**
  * do_lock_mount - lock mount and mountpoint
  * @path:    target path
+ * @pinned: on success, holds a pin guarding the mountpoint
  * @beneath: whether the intention is to mount beneath @path
  *
  * Follow the mount stack on @path until the top mount @mnt is found. If
@@ -2769,8 +2770,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
  * to @mnt->mnt_mp->m_dentry. But if @mnt has been unmounted it will
  * point to @mnt->mnt_root and @mnt->mnt_mp will be NULL.
  *
- * Return: Either the target mountpoint on the top mount or the top
- *         mount's mountpoint.
+ * Return: On success, 0 is returned. On failure, err is returned.
  */
 static int do_lock_mount(struct path *path, struct pinned_mountpoint *pinned, bool beneath)
 {
-- 
2.43.0

[RFC] {do_,}lock_mount() behaviour wrt races and move_mount(2) with empty to_path (was Re: [PATCH] fs/namespace.c: fix mountpath handling in do_lock_mount())

Posted by Al Viro 5 months, 3 weeks ago

On Tue, Aug 19, 2025 at 02:22:35AM +0900, Ryan Chung wrote:
> Updates documentation for do_lock_mount() in fs/namespace.c
> to clarify its parameters and return description to fix
> warning reported by syzbot.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202506301911.uysRaP8b-lkp@intel.com/
> Signed-off-by: Ryan Chung <seokwoo.chung130@gmail.com>
> ---
>  fs/namespace.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index ddfd4457d338..577fdff9f1a8 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -2741,6 +2741,7 @@ static int attach_recursive_mnt(struct mount *source_mnt,
>  /**
>   * do_lock_mount - lock mount and mountpoint
>   * @path:    target path
> + * @pinned: on success, holds a pin guarding the mountpoint

I'm not sure if 'pin' is suitable here and in any case, that's not the
only problem in that description - take a look at "Return:" part in there.

The underlying problem is the semantics of function itself.  lock_mount()
assumed that it was called on the result of pathname resolution; the
question is what to do if we race with somebody mounting something
on top of the same location while we had been grabbing namespace_sem?
"Follow through to the root of whatever's been mounted on top, same as
we'd done if pathname resolution happened slightly later" used to be a
reasonable answer, but these days we have move_mount(2), where we have
	* MOVE_MOUNT_T_EMPTY_PATH combined with empty pathname, which
will have us start with whatever the descriptor is pointing to, mounts
or no mounts.  Choosing to treat that as "follow mounts anyway" is not
a big deal.
	* MOVE_MOUNT_BENEATH - treated as "follow mounts and slip the
damn thing under the topmost one".  Again, OK for non-empty pathname,
but... for empty ones the rationale is weaker.

Alternative would be to treat these races as "act as if we'd won and
the other guy had overmounted ours", i.e. *NOT* follow mounts.  Again,
for old syscalls that's fine - if another thread has raced with us and
mounted something on top of the place we want to mount on, it could just
as easily have come *after* we'd completed mount(2) and mounted their
stuff on top of ours.  If userland is not fine with such outcome, it needs
to provide serialization between the callers.  For move_mount(2)... again,
the only real question is empty to_path case.

Comments?

Note, BTW, that attach_recursive_mnt() used to require dest_mnt/dest_mp
to be on the very top; since 6.16 it treats that as "slip it under
whatever's on top of that" - that's exactly what happens in 'beneath'
case.  So the second alternative is easily doable these days.  And
it would really simplify the lock_mount()/do_lock_mount()...

Re: [RFC] {do_,}lock_mount() behaviour wrt races and move_mount(2) with empty to_path (was Re: [PATCH] fs/namespace.c: fix mountpath handling in do_lock_mount())

Posted by Al Viro 5 months, 3 weeks ago

On Mon, Aug 18, 2025 at 09:14:28PM +0100, Al Viro wrote:

> Alternative would be to treat these races as "act as if we'd won and
> the other guy had overmounted ours", i.e. *NOT* follow mounts.  Again,
> for old syscalls that's fine - if another thread has raced with us and
> mounted something on top of the place we want to mount on, it could just
> as easily have come *after* we'd completed mount(2) and mounted their
> stuff on top of ours.  If userland is not fine with such outcome, it needs
> to provide serialization between the callers.  For move_mount(2)... again,
> the only real question is empty to_path case.
> 
> Comments?

Thinking about it a bit more...  Unfortunately, there's another corner
case: "." as mountpoint.  That would affect that old syscalls as well
and I'm not sure that there's no userland code that relies upon the
current behaviour.

Background: pathname resolution does *NOT* follow mounts on the starting
point and it does not follow mounts after "."

; mkdir /tmp/foo
; mount -t tmpfs none /tmp/foo
; cd /tmp/foo
; echo under > a
; cat /tmp/foo/a
under
; mount -t tmpfs none /tmp/foo
; cat a
under
; cat /tmp/foo/a
cat: /tmp/foo/a: no such file or directory
; echo under > b
; cat b
under
; cat /tmp/foo/b
cat: /tmp/foo/b: no such file or directory
;

It's been a bad decision (if it can be called that - it's been more
of an accident, AFAICT), but it's decades too late to change it.
And interaction with mount is also fun: mount(2) *DOES* follow mounts
on the end of any pathname, no matter what.  So in case when we are
standing in an overmounted directory, ls . will show the contents of
that directory, but mount <something> . will mount on top of whatever's
mounted there.

So the alternative I've mentioned above would change the behaviour of
old syscalls in a corner case that just might be actually used in userland
code - including the scripts run at the boot time, of all things ;-/

IOW, it probably falls under "can't touch that, no matter how much we'd
like to" ;-/  Pity, that...

That leaves the question of MOVE_MOUNT_BENEATH with empty pathname -
do we want a variant that would say "slide precisely under the opened
directory I gave you, no matter what might overmount it"?

At the very least this corner case needs to be documented in move_mount(2)
- behaviour of
	move_mount(_, _, dir_fd, "",
		   MOVE_MOUNT_T_EMPTY | MOVE_MOUNT_BENEATH)
has two apriori reasonable variants ("slide right under the top of
whatever pile there might be over dir_fd" and "slide right under dir_fd
itself, no matter what pile might be on top of that") and leaving it
unspecified is not good, IMO...

Re: [RFC] {do_,}lock_mount() behaviour wrt races and move_mount(2) with empty to_path (was Re: [PATCH] fs/namespace.c: fix mountpath handling in do_lock_mount())

Posted by Christian Brauner 5 months, 3 weeks ago

On Mon, Aug 18, 2025 at 09:56:06PM +0100, Al Viro wrote:
> On Mon, Aug 18, 2025 at 09:14:28PM +0100, Al Viro wrote:
> 
> > Alternative would be to treat these races as "act as if we'd won and
> > the other guy had overmounted ours", i.e. *NOT* follow mounts.  Again,
> > for old syscalls that's fine - if another thread has raced with us and
> > mounted something on top of the place we want to mount on, it could just
> > as easily have come *after* we'd completed mount(2) and mounted their
> > stuff on top of ours.  If userland is not fine with such outcome, it needs
> > to provide serialization between the callers.  For move_mount(2)... again,
> > the only real question is empty to_path case.
> > 
> > Comments?
> 
> Thinking about it a bit more...  Unfortunately, there's another corner
> case: "." as mountpoint.  That would affect that old syscalls as well
> and I'm not sure that there's no userland code that relies upon the
> current behaviour.
> 
> Background: pathname resolution does *NOT* follow mounts on the starting
> point and it does not follow mounts after "."
> 
> ; mkdir /tmp/foo
> ; mount -t tmpfs none /tmp/foo
> ; cd /tmp/foo
> ; echo under > a
> ; cat /tmp/foo/a
> under
> ; mount -t tmpfs none /tmp/foo
> ; cat a
> under
> ; cat /tmp/foo/a
> cat: /tmp/foo/a: no such file or directory
> ; echo under > b
> ; cat b
> under
> ; cat /tmp/foo/b
> cat: /tmp/foo/b: no such file or directory
> ;
> 
> It's been a bad decision (if it can be called that - it's been more
> of an accident, AFAICT), but it's decades too late to change it.
> And interaction with mount is also fun: mount(2) *DOES* follow mounts
> on the end of any pathname, no matter what.  So in case when we are
> standing in an overmounted directory, ls . will show the contents of
> that directory, but mount <something> . will mount on top of whatever's
> mounted there.
> 
> So the alternative I've mentioned above would change the behaviour of
> old syscalls in a corner case that just might be actually used in userland
> code - including the scripts run at the boot time, of all things ;-/
> 
> IOW, it probably falls under "can't touch that, no matter how much we'd
> like to" ;-/  Pity, that...
> 
> That leaves the question of MOVE_MOUNT_BENEATH with empty pathname -
> do we want a variant that would say "slide precisely under the opened
> directory I gave you, no matter what might overmount it"?

Afaict, right now MOVE_MOUNT_BENEATH will take the overmount into
account even for "." just like mount(2) will lookup the topmost mount no
matter what. That is what userspace expects. I don't think we need a
variant where "." ignores overmounts for MOVE_MOUNT_BENEATH and really
not unless someone has a specific use-case for it. If it comes to that
we should probably add a new flag.

> 
> At the very least this corner case needs to be documented in move_mount(2)
> - behaviour of
> 	move_mount(_, _, dir_fd, "",
> 		   MOVE_MOUNT_T_EMPTY | MOVE_MOUNT_BENEATH)
> has two apriori reasonable variants ("slide right under the top of
> whatever pile there might be over dir_fd" and "slide right under dir_fd

Yes, that's what's intended and documented also what I wrote in my
commit messages and what the selftests should test for. I specifically
did not make it deviate from standard mount(2) behavior.

> itself, no matter what pile might be on top of that") and leaving it
> unspecified is not good, IMO...

Sure, Aleksa can pull that into his documentation patches.

Re: [RFC] {do_,}lock_mount() behaviour wrt races and move_mount(2) with empty to_path (was Re: [PATCH] fs/namespace.c: fix mountpath handling in do_lock_mount())

Posted by Ryan Chung 4 months, 2 weeks ago

On Tue, Aug 19, 2025 at 11:40:14AM +0200, Christian Brauner wrote:
> On Mon, Aug 18, 2025 at 09:56:06PM +0100, Al Viro wrote:
> > On Mon, Aug 18, 2025 at 09:14:28PM +0100, Al Viro wrote:
> > 
> > > Alternative would be to treat these races as "act as if we'd won and
> > > the other guy had overmounted ours", i.e. *NOT* follow mounts.  Again,
> > > for old syscalls that's fine - if another thread has raced with us and
> > > mounted something on top of the place we want to mount on, it could just
> > > as easily have come *after* we'd completed mount(2) and mounted their
> > > stuff on top of ours.  If userland is not fine with such outcome, it needs
> > > to provide serialization between the callers.  For move_mount(2)... again,
> > > the only real question is empty to_path case.
> > > 
> > > Comments?
> > 
> > Thinking about it a bit more...  Unfortunately, there's another corner
> > case: "." as mountpoint.  That would affect that old syscalls as well
> > and I'm not sure that there's no userland code that relies upon the
> > current behaviour.
> > 
> > Background: pathname resolution does *NOT* follow mounts on the starting
> > point and it does not follow mounts after "."
> > 
> > ; mkdir /tmp/foo
> > ; mount -t tmpfs none /tmp/foo
> > ; cd /tmp/foo
> > ; echo under > a
> > ; cat /tmp/foo/a
> > under
> > ; mount -t tmpfs none /tmp/foo
> > ; cat a
> > under
> > ; cat /tmp/foo/a
> > cat: /tmp/foo/a: no such file or directory
> > ; echo under > b
> > ; cat b
> > under
> > ; cat /tmp/foo/b
> > cat: /tmp/foo/b: no such file or directory
> > ;
> > 
> > It's been a bad decision (if it can be called that - it's been more
> > of an accident, AFAICT), but it's decades too late to change it.
> > And interaction with mount is also fun: mount(2) *DOES* follow mounts
> > on the end of any pathname, no matter what.  So in case when we are
> > standing in an overmounted directory, ls . will show the contents of
> > that directory, but mount <something> . will mount on top of whatever's
> > mounted there.
> > 
> > So the alternative I've mentioned above would change the behaviour of
> > old syscalls in a corner case that just might be actually used in userland
> > code - including the scripts run at the boot time, of all things ;-/
> > 
> > IOW, it probably falls under "can't touch that, no matter how much we'd
> > like to" ;-/  Pity, that...
> > 
> > That leaves the question of MOVE_MOUNT_BENEATH with empty pathname -
> > do we want a variant that would say "slide precisely under the opened
> > directory I gave you, no matter what might overmount it"?
> 
> Afaict, right now MOVE_MOUNT_BENEATH will take the overmount into
> account even for "." just like mount(2) will lookup the topmost mount no
> matter what. That is what userspace expects. I don't think we need a
> variant where "." ignores overmounts for MOVE_MOUNT_BENEATH and really
> not unless someone has a specific use-case for it. If it comes to that
> we should probably add a new flag.
> 
> > 
> > At the very least this corner case needs to be documented in move_mount(2)
> > - behaviour of
> > 	move_mount(_, _, dir_fd, "",
> > 		   MOVE_MOUNT_T_EMPTY | MOVE_MOUNT_BENEATH)
> > has two apriori reasonable variants ("slide right under the top of
> > whatever pile there might be over dir_fd" and "slide right under dir_fd
> 
> Yes, that's what's intended and documented also what I wrote in my
> commit messages and what the selftests should test for. I specifically
> did not make it deviate from standard mount(2) behavior.
> 
> > itself, no matter what pile might be on top of that") and leaving it
> > unspecified is not good, IMO...
> 
> Sure, Aleksa can pull that into his documentation patches.

Hello all,

I am writing to follow up on this RFC patch. The last discussion was a
month ago and it seems like the conversation has stalled.

Thank you.

Best regards,
Ryan Chung