[PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API

Aleksa Sarai posted 12 patches 4 months, 1 week ago
There is a newer version of this series
[PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
Posted by Aleksa Sarai 4 months, 1 week ago
This is loosely based on the original documentation written by David
Howells and later maintained by Christian Brauner, but has been
rewritten to be more from a user perspective (as well as fixing a few
critical mistakes).

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man/man2/fsmount.2 | 220 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 220 insertions(+)

diff --git a/man/man2/fsmount.2 b/man/man2/fsmount.2
new file mode 100644
index 0000000000000000000000000000000000000000..92331cb18272f9ac836e55e7f28faea3a3efbdac
--- /dev/null
+++ b/man/man2/fsmount.2
@@ -0,0 +1,220 @@
+.\" Copyright, the authors of the Linux man-pages project
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH fsmount 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+fsmount \- instantiate mount object from filesystem context
+.SH LIBRARY
+Standard C library
+.RI ( libc ,\~ \-lc )
+.SH SYNOPSIS
+.nf
+.B #include <sys/mount.h>
+.P
+.BI "int fsmount(int " fsfd ", unsigned int " flags ", \
+unsigned int " attr_flags ");"
+.fi
+.SH DESCRIPTION
+The
+.BR fsmount ()
+system call is part of
+the suite of file descriptor based mount facilities in Linux.
+.P
+.BR fsmount ()
+creates a new detached mount object
+for the root of the new filesystem instance
+referenced by the filesystem context file descriptor
+.IR fsfd .
+A new file descriptor
+associated with the detached mount object
+is then returned.
+In order to create a mount object with
+.BR fsmount (),
+the calling process must have the
+.BR \%CAP_SYS_ADMIN
+capability.
+.P
+The filesystem context must have been created with a call to
+.BR fsopen (2)
+and then had a filesystem instance instantiated with a call to
+.BR fsconfig (2)
+with
+.B \%FSCONFIG_CMD_CREATE
+or
+.B \%FSCONFIG_CMD_CREATE_EXCL
+in order to be in the correct state
+for this operation
+(the "awaiting-mount" mode in kernel-developer parlance).
+.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
+Unlike
+.BR open_tree (2)
+with
+.BR \%OPEN_TREE_CLONE,
+.BR fsmount ()
+can only be called once
+in the lifetime of a filesystem instance
+to produce a mount object.
+.P
+As with file descriptors returned from
+.BR open_tree (2)
+called with
+.BR OPEN_TREE_CLONE ,
+the returned file descriptor
+can then be used with
+.BR move_mount (2),
+.BR mount_setattr (2),
+or other such system calls to do further mount operations.
+This mount object will be unmounted and destroyed
+when the file descriptor is closed
+if it was not otherwise attached to a mount point
+by calling
+.BR move_mount (2).
+The returned file descriptor
+also acts the same as one produced by
+.BR open (2)
+with
+.BR O_PATH ,
+meaning it can also be used as a
+.I dirfd
+argument
+to "*at()" system calls.
+.P
+.I flags
+controls the creation of the returned file descriptor.
+A value for
+.I flags
+is constructed by bitwise ORing
+zero or more of the following constants:
+.RS
+.TP
+.B FSMOUNT_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.RE
+.P
+.I attr_flags
+specifies mount attributes
+which will be applied to the created mount object,
+in the form of
+.BI \%MOUNT_ATTR_ *
+flags.
+The flags are interpreted as though
+.BR mount_setattr (2)
+was called with
+.I attr.attr_set
+set to the same value as
+.IR attr_flags .
+.BI \% MOUNT_ATTR_ *
+flags which would require
+specifying additional fields in
+.BR mount_attr (2type)
+(such as
+.BR \%MOUNT_ATTR_IDMAP )
+are not valid flag values for
+.IR attr_flags .
+.P
+If the
+.BR fsmount ()
+operation is successful,
+the filesystem context
+associated with the file descriptor
+.I fsfd
+is reset
+and placed into reconfiguration mode,
+as if it were just returned by
+.BR fspick (2).
+You may continue to use
+.BR fsconfig (2)
+with the now-reset filesystem context,
+including issuing the
+.B \%FSCONFIG_CMD_RECONFIGURE
+command
+to reconfigure the filesystem instance.
+.SH RETURN VALUE
+On success, a new file descriptor is returned.
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBUSY
+The filesystem context associated with
+.I fsfd
+is not in the right state
+to be used by
+.BR fsmount ().
+.TP
+.B EINVAL
+.I flags
+had an invalid flag set.
+.TP
+.B EINVAL
+.I attr_flags
+had an invalid
+.BI MOUNT_ATTR_ *
+flag set.
+.TP
+.B EMFILE
+The calling process has too many open files to create more.
+.TP
+.B ENFILE
+The system has too many open files to create more.
+.TP
+.B ENOSPC
+The "anonymous" mount namespace
+necessary to contain the new mount object
+could not be allocated,
+as doing so would exceed
+the configured per-user limit on
+the number of mount namespaces in the current user namespace.
+(See also
+.BR namespaces (7).)
+.TP
+.B ENOMEM
+The kernel could not allocate sufficient memory to complete the operation.
+.TP
+.B EPERM
+The calling process does not have the required
+.B CAP_SYS_ADMIN
+capability.
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 5.2.
+.\" commit 93766fbd2696c2c4453dd8e1070977e9cd4e6b6d
+.\" commit 400913252d09f9cfb8cce33daee43167921fc343
+glibc 2.36.
+.SH EXAMPLES
+.in +4n
+.EX
+int fsfd, mntfd, tmpfd;
+\&
+fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NODEV | MOUNT_ATTR_NOEXEC);
+\&
+/* Create a new file without attaching the mount object. */
+int tmpfd = openat(mntfd, "tmpfile", O_CREAT | O_EXCL | O_RDWR, 0600);
+unlinkat(mntfd, "tmpfile", 0);
+\&
+/* Attach the mount object to "/tmp". */
+move_mount(mntfd, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH);
+.EE
+.in
+.SH SEE ALSO
+.BR fsconfig (2),
+.BR fsopen (2),
+.BR fspick (2),
+.BR mount (2),
+.BR mount_setattr (2),
+.BR move_mount (2),
+.BR open_tree (2),
+.BR mount_namespaces (7)
+

-- 
2.50.1
Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
Posted by Askar Safin 4 months ago
fsmount:
> Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object.

I don't understand what you meant here. This phrase in its current form is wrong.
Consider this scenario: we did this:
fsopen(...)
fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
fsconfig(..., FSCONFIG_CMD_CREATE, ...)
fsmount(...)
fsopen(...)
fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
fsconfig(..., FSCONFIG_CMD_CREATE, ...)
fsmount(...)

We used FSCONFIG_CMD_CREATE here as opposed to FSCONFIG_CMD_CREATE_EXCL, thus
it is possible that second fsmount will return mount for the same superblock.
Thus that statement "fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object"
is not true.

--
Askar Safin
https://types.pl/@safinaskar
Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
Posted by Aleksa Sarai 4 months ago
On 2025-08-12, Askar Safin <safinaskar@zohomail.com> wrote:
> fsmount:
> > Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object.
>
> I don't understand what you meant here. This phrase in its current form is wrong.
> Consider this scenario: we did this:
> fsopen(...)
> fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
> fsconfig(..., FSCONFIG_CMD_CREATE, ...)
> fsmount(...)
> fsopen(...)
> fsconfig(..., FSCONFIG_SET_STRING, "source", ...)
> fsconfig(..., FSCONFIG_CMD_CREATE, ...)
> fsmount(...)
> 
> We used FSCONFIG_CMD_CREATE here as opposed to FSCONFIG_CMD_CREATE_EXCL, thus
> it is possible that second fsmount will return mount for the same superblock.
> Thus that statement "fsmount() can only be called once in the lifetime of a filesystem instance to produce a mount object"
> is not true.

Yeah, the superblock reuse behaviour makes this description less
coherent than what I was going for. My thinking was that a reused
superblock is (to userspace) conceptually a new filesystem instance
because they create it the same way as any other filesystem instance.
(In fact, the rest of the VFS treats them the same way too -- only
sget_fc() knows about superblock reuse.)

But yeah, "filesystem context" is more accurate here, so probably just:

  Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
  once in the lifetime of a filesystem context.

Though maybe we should mention that it's fsopen(2)-only (even though
it's mentioned earlier in the DESCRIPTION)? If you read the sentence in
isolation you might get the wrong impression. Do you have any
alternative suggestions?

FWIW, superblock reuse is one of those things that is a fairly hairy
implementation detail of the VFS, and as such it has quite odd
semantics. I probably wouldn't have documented it as heavily if it
wasn't for the addition of FSCONFIG_CMD_CREATE_EXCL (maybe an entry in
BUGS or CAVEATS at most -- this behaviour has an even worse impact on
mount(2) but it's completely undocumented there).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
Posted by Askar Safin 3 months, 3 weeks ago
 ---- On Tue, 12 Aug 2025 18:33:04 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 >   Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
 >   once in the lifetime of a filesystem context.

Weird. open_tree doesn't get filesystem context as argument at all.
I suggest just this:

  fsmount() can only be called
  once in the lifetime of a filesystem context.

--
Askar Safin
https://types.pl/@safinaskar
Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
Posted by Aleksa Sarai 3 months, 3 weeks ago
On 2025-08-20, Askar Safin <safinaskar@zohomail.com> wrote:
>  ---- On Tue, 12 Aug 2025 18:33:04 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
>  >   Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
>  >   once in the lifetime of a filesystem context.
> 
> Weird. open_tree doesn't get filesystem context as argument at all.
> I suggest just this:
> 
>   fsmount() can only be called
>   once in the lifetime of a filesystem context.

The reason I wanted to include the comparison is that you can create
multiple mount objects from the same underlying object using
open_tree(2) but that's not possible with fsmount(2) (at least, not
without creating a new filesystem context each time).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
Posted by Askar Safin 3 months, 3 weeks ago
 ---- On Wed, 20 Aug 2025 14:38:48 +0400  Aleksa Sarai <cyphar@cyphar.com> wrote --- 
 > The reason I wanted to include the comparison is that you can create
 > multiple mount objects from the same underlying object using
 > open_tree(2) but that's not possible with fsmount(2) (at least, not
 > without creating a new filesystem context each time).

Okay, you may write that.

--
Askar Safin
https://types.pl/@safinaskar
Re: [PATCH v3 07/12] man/man2/fsmount.2: document "new" mount API
Posted by Aleksa Sarai 4 months ago
On 2025-08-13, Aleksa Sarai <cyphar@cyphar.com> wrote:
> But yeah, "filesystem context" is more accurate here, so probably just:

Oops, I meant to include

>   Unlike open_tree(2) with OPEN_TREE_CLONE, fsmount() can only be called
>   once in the lifetime of a filesystem context.
                                                 to create a mount object.

at the end.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/