fs/namespace.c | 68 ++++++++++++++++++++++++++++++++++++++++++---- include/uapi/linux/mount.h | 6 +++- 2 files changed, 67 insertions(+), 7 deletions(-)
Meta has some internal logging that scrapes /proc/self/mountinfo today. I'd like to convert it to use listmount()/statmount(), so we can do a better job of monitoring with containers. We're missing some fields though. This patchset adds them. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- Changes in v4: - Rename mnt_devname to sb_source - Break statmount_string() behavior changes out into separate patch - Link to v3: https://lore.kernel.org/r/20241107-statmount-v3-0-da5b9744c121@kernel.org Changes in v3: - Unescape the result of ->show_devname - Move handling of nothing being emitted out of the switch statement - Link to v2: https://lore.kernel.org/r/20241106-statmount-v2-0-93ba2aad38d1@kernel.org Changes in v2: - make statmount_fs_subtype - return fast if no subtype is emitted - new patch to allow statmount() to return devicename - Link to v1: https://lore.kernel.org/r/20241106-statmount-v1-1-b93bafd97621@kernel.org --- Jeff Layton (3): fs: don't let statmount return empty strings fs: add the ability for statmount() to report the fs_subtype fs: add the ability for statmount() to report the sb_source fs/namespace.c | 68 ++++++++++++++++++++++++++++++++++++++++++---- include/uapi/linux/mount.h | 6 +++- 2 files changed, 67 insertions(+), 7 deletions(-) --- base-commit: 26213e1a6caa5a7f508b919059b0122b451f4dfe change-id: 20241106-statmount-3f91a7ed75fa Best regards, -- Jeff Layton <jlayton@kernel.org>
On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote: > Meta has some internal logging that scrapes /proc/self/mountinfo today. > I'd like to convert it to use listmount()/statmount(), so we can do a > better job of monitoring with containers. We're missing some fields > though. This patchset adds them. > > Applied to the vfs.misc branch of the vfs/vfs.git tree. Patches in the vfs.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. Note that commit hashes shown below are subject to change due to rebase, trailer updates or similar. If in doubt, please check the listed branch. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.misc [1/3] fs: don't let statmount return empty strings https://git.kernel.org/vfs/vfs/c/75ead69a7173 [2/3] fs: add the ability for statmount() to report the fs_subtype https://git.kernel.org/vfs/vfs/c/ed9d95f691c2 [3/3] fs: add the ability for statmount() to report the sb_source https://git.kernel.org/vfs/vfs/c/d31cea0ab403
On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote: > > Meta has some internal logging that scrapes /proc/self/mountinfo today. > > I'd like to convert it to use listmount()/statmount(), so we can do a > > better job of monitoring with containers. We're missing some fields > > though. This patchset adds them. > > > > > > Applied to the vfs.misc branch of the vfs/vfs.git tree. > Patches in the vfs.misc branch should appear in linux-next soon. Jeff, thank you for this! I have already implemented support for statmount() and listmount() in libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The only remaining issue was the mount source and incomplete file system type. Currently, the library does not use these syscalls when mounting (it still uses /proc/#/mountinfo). However, there is already a library API to fetch mount information from the kernel using listmount+statmount, and it can be accessed from the command line using: findmnt --kernel=listmount Next on the wish list is a notification (a file descriptor that can be used in epoll) that returns a 64-bit ID when there is a change in the mount node. This will enable us to enhance systemd so that it does not have to read the entire mount table after every change. Karel -- Karel Zak <kzak@redhat.com> http://karelzak.blogspot.com
On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote: > > > Meta has some internal logging that scrapes /proc/self/mountinfo today. > > > I'd like to convert it to use listmount()/statmount(), so we can do a > > > better job of monitoring with containers. We're missing some fields > > > though. This patchset adds them. > > > > > > > > > > Applied to the vfs.misc branch of the vfs/vfs.git tree. > > Patches in the vfs.misc branch should appear in linux-next soon. > > Jeff, thank you for this! > > I have already implemented support for statmount() and listmount() in > libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The > only remaining issue was the mount source and incomplete file system > type. > Unfortunately, I think we might be missing something else: The mountinfo (and "mounts") file generator calls show_sb_opts() which generates some strings from the sb->s_flags field and then calls security_sb_show_options(). statmount() will give you the s_flags field (or an equivalent), but it doesn't give you the security options string. So, those aren't currently visible from statmount(). How should we expose those? Should we create a new statmount string field and populate it, or is it better to just tack them onto the end of the statmount.mnt_opts string? > Currently, the library does not use these syscalls when mounting (it > still uses /proc/#/mountinfo). However, there is already a library API > to fetch mount information from the kernel using listmount+statmount, > and it can be accessed from the command line using: > > findmnt --kernel=listmount > Very cool. I need to play with findmnt more. > Next on the wish list is a notification (a file descriptor that can be > used in epoll) that returns a 64-bit ID when there is a change in the > mount node. This will enable us to enhance systemd so that it does not > have to read the entire mount table after every change. > New fanotify events for mount table changes, perhaps? -- Jeff Layton <jlayton@kernel.org>
On Wed, Nov 13, 2024 at 08:45:06AM -0500, Jeff Layton wrote: > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote: > > > > Meta has some internal logging that scrapes /proc/self/mountinfo today. > > > > I'd like to convert it to use listmount()/statmount(), so we can do a > > > > better job of monitoring with containers. We're missing some fields > > > > though. This patchset adds them. > > > > > > > > > > > > > > Applied to the vfs.misc branch of the vfs/vfs.git tree. > > > Patches in the vfs.misc branch should appear in linux-next soon. > > > > Jeff, thank you for this! > > > > I have already implemented support for statmount() and listmount() in > > libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The > > only remaining issue was the mount source and incomplete file system > > type. > > > > Unfortunately, I think we might be missing something else: > > The mountinfo (and "mounts") file generator calls show_sb_opts() which > generates some strings from the sb->s_flags field and then calls > security_sb_show_options(). statmount() will give you the s_flags field > (or an equivalent), but it doesn't give you the security options > string. So, those aren't currently visible from statmount(). > > How should we expose those? Should we create a new statmount string > field and populate it, or is it better to just tack them onto the end > of the statmount.mnt_opts string? I'm leaning towards using a separate field because mnt_opts/opts_array is about filesystem specific mount options whereas the security mount options are somewhat generic. So it's easy to tell them apart.
On Thu, 2024-11-14 at 12:29 +0100, Christian Brauner wrote: > On Wed, Nov 13, 2024 at 08:45:06AM -0500, Jeff Layton wrote: > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > > > On Mon, 11 Nov 2024 10:09:54 -0500, Jeff Layton wrote: > > > > > Meta has some internal logging that scrapes /proc/self/mountinfo today. > > > > > I'd like to convert it to use listmount()/statmount(), so we can do a > > > > > better job of monitoring with containers. We're missing some fields > > > > > though. This patchset adds them. > > > > > > > > > > > > > > > > > > Applied to the vfs.misc branch of the vfs/vfs.git tree. > > > > Patches in the vfs.misc branch should appear in linux-next soon. > > > > > > Jeff, thank you for this! > > > > > > I have already implemented support for statmount() and listmount() in > > > libmount (PR: https://github.com/util-linux/util-linux/pull/3092). The > > > only remaining issue was the mount source and incomplete file system > > > type. > > > > > > > Unfortunately, I think we might be missing something else: > > > > The mountinfo (and "mounts") file generator calls show_sb_opts() which > > generates some strings from the sb->s_flags field and then calls > > security_sb_show_options(). statmount() will give you the s_flags field > > (or an equivalent), but it doesn't give you the security options > > string. So, those aren't currently visible from statmount(). > > > > How should we expose those? Should we create a new statmount string > > field and populate it, or is it better to just tack them onto the end > > of the statmount.mnt_opts string? > > I'm leaning towards using a separate field because mnt_opts/opts_array > is about filesystem specific mount options whereas the security mount > options are somewhat generic. So it's easy to tell them apart. Ordinarily, I might agree, but we're now growing a new mount option field that has them separated by NULs. Will we need two extra fields for this? One comma-separated, and one NUL separated? /proc/#/mountinfo and mounts prepend these to the output of ->show_options, so the simple solution would be to just prepend those there instead of adding a new field. FWIW, only SELinux has any extra mount options to show here. Tough call -- anyone else have opinions? -- Jeff Layton <jlayton@kernel.org>
On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote: > Ordinarily, I might agree, but we're now growing a new mount option > field that has them separated by NULs. Will we need two extra fields > for this? One comma-separated, and one NUL separated? > > /proc/#/mountinfo and mounts prepend these to the output of > ->show_options, so the simple solution would be to just prepend those > there instead of adding a new field. FWIW, only SELinux has any extra > mount options to show here. Compromise: tack them onto the end of the comma separated list, but add a new field for the nul separated security options. I think this would be logical, since the comma separated list is more useful for having a /proc/$$/mountinfo compatible string than for actually interpreting what's in there. Thanks, Miklos
On Thu, Nov 14, 2024 at 02:16:05PM +0100, Miklos Szeredi wrote: > On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote: > > > Ordinarily, I might agree, but we're now growing a new mount option > > field that has them separated by NULs. Will we need two extra fields > > for this? One comma-separated, and one NUL separated? > > > > /proc/#/mountinfo and mounts prepend these to the output of > > ->show_options, so the simple solution would be to just prepend those > > there instead of adding a new field. FWIW, only SELinux has any extra > > mount options to show here. > > Compromise: tack them onto the end of the comma separated list, but > add a new field for the nul separated security options. > > I think this would be logical, since the comma separated list is more > useful for having a /proc/$$/mountinfo compatible string than for > actually interpreting what's in there. Fair. Here's an incremental for the array of security options. diff --git a/fs/namespace.c b/fs/namespace.c index 4f39c4aba85d..a9065a9ab971 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -5072,13 +5072,30 @@ static int statmount_mnt_opts(struct kstatmount *s, struct seq_file *seq) return 0; } +static inline int statmount_opt_unescape(struct seq_file *seq, char *buf_start) +{ + char *buf_end, *opt_start, *opt_end; + int count = 0; + + buf_end = seq->buf + seq->count; + *buf_end = '\0'; + for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) { + opt_end = strchrnul(opt_start, ','); + *opt_end = '\0'; + buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1; + if (WARN_ON_ONCE(++count == INT_MAX)) + return -EOVERFLOW; + } + seq->count = buf_start - 1 - seq->buf; + return count; +} + static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq) { struct vfsmount *mnt = s->mnt; struct super_block *sb = mnt->mnt_sb; size_t start = seq->count; - char *buf_start, *buf_end, *opt_start, *opt_end; - u32 count = 0; + char *buf_start; int err; if (!sb->s_op->show_options) @@ -5095,17 +5112,39 @@ static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq) if (seq->count == start) return 0; - buf_end = seq->buf + seq->count; - *buf_end = '\0'; - for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) { - opt_end = strchrnul(opt_start, ','); - *opt_end = '\0'; - buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1; - if (WARN_ON_ONCE(++count == 0)) - return -EOVERFLOW; - } - seq->count = buf_start - 1 - seq->buf; - s->sm.opt_num = count; + err = statmount_opt_unescape(seq, buf_start); + if (err < 0) + return err; + + s->sm.opt_num = err; + return 0; +} + +static int statmount_opt_sec_array(struct kstatmount *s, struct seq_file *seq) +{ + struct vfsmount *mnt = s->mnt; + struct super_block *sb = mnt->mnt_sb; + size_t start = seq->count; + char *buf_start; + int err; + + buf_start = seq->buf + start; + + err = security_sb_show_options(seq, sb); + if (!err) + return err; + + if (unlikely(seq_has_overflowed(seq))) + return -EAGAIN; + + if (seq->count == start) + return 0; + + err = statmount_opt_unescape(seq, buf_start); + if (err < 0) + return err; + + s->sm.opt_sec_num = err; return 0; } @@ -5138,6 +5177,10 @@ static int statmount_string(struct kstatmount *s, u64 flag) sm->opt_array = start; ret = statmount_opt_array(s, seq); break; + case STATMOUNT_OPT_SEC_ARRAY: + sm->opt_sec_array = start; + ret = statmount_opt_sec_array(s, seq); + break; case STATMOUNT_FS_SUBTYPE: sm->fs_subtype = start; statmount_fs_subtype(s, seq); @@ -5294,6 +5337,9 @@ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id, if (!err && s->mask & STATMOUNT_OPT_ARRAY) err = statmount_string(s, STATMOUNT_OPT_ARRAY); + if (!err && s->mask & STATMOUNT_OPT_SEC_ARRAY) + err = statmount_string(s, STATMOUNT_OPT_SEC_ARRAY); + if (!err && s->mask & STATMOUNT_FS_SUBTYPE) err = statmount_string(s, STATMOUNT_FS_SUBTYPE); @@ -5323,7 +5369,7 @@ static inline bool retry_statmount(const long ret, size_t *seq_size) #define STATMOUNT_STRING_REQ (STATMOUNT_MNT_ROOT | STATMOUNT_MNT_POINT | \ STATMOUNT_FS_TYPE | STATMOUNT_MNT_OPTS | \ STATMOUNT_FS_SUBTYPE | STATMOUNT_SB_SOURCE | \ - STATMOUNT_OPT_ARRAY) + STATMOUNT_OPT_ARRAY | STATMOUNT_OPT_SEC_ARRAY) static int prepare_kstatmount(struct kstatmount *ks, struct mnt_id_req *kreq, struct statmount __user *buf, size_t bufsize, diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h index c0fda4604187..569d938a5757 100644 --- a/include/uapi/linux/mount.h +++ b/include/uapi/linux/mount.h @@ -177,7 +177,9 @@ struct statmount { __u32 sb_source; /* [str] Source string of the mount */ __u32 opt_num; /* Number of fs options */ __u32 opt_array; /* [str] Array of nul terminated fs options */ - __u64 __spare2[47]; + __u32 opt_sec_num; /* Number of security options */ + __u32 opt_sec_array; /* [str] Array of nul terminated security options */ + __u64 __spare2[45]; char str[]; /* Variable size part containing strings */ }; @@ -214,6 +216,7 @@ struct mnt_id_req { #define STATMOUNT_FS_SUBTYPE 0x00000100U /* Want/got fs_subtype */ #define STATMOUNT_SB_SOURCE 0x00000200U /* Want/got sb_source */ #define STATMOUNT_OPT_ARRAY 0x00000400U /* Want/got opt_... */ +#define STATMOUNT_OPT_SEC_ARRAY 0x00000800U /* Want/got opt_sec... */ /* * Special @mnt_id values that can be passed to listmount
On Thu, 2024-11-14 at 15:48 +0100, Christian Brauner wrote: > On Thu, Nov 14, 2024 at 02:16:05PM +0100, Miklos Szeredi wrote: > > On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote: > > > > > Ordinarily, I might agree, but we're now growing a new mount option > > > field that has them separated by NULs. Will we need two extra fields > > > for this? One comma-separated, and one NUL separated? > > > > > > /proc/#/mountinfo and mounts prepend these to the output of > > > ->show_options, so the simple solution would be to just prepend those > > > there instead of adding a new field. FWIW, only SELinux has any extra > > > mount options to show here. > > > > Compromise: tack them onto the end of the comma separated list, but > > add a new field for the nul separated security options. > > > > I think this would be logical, since the comma separated list is more > > useful for having a /proc/$$/mountinfo compatible string than for > > actually interpreting what's in there. > > Fair. Here's an incremental for the array of security options. > > diff --git a/fs/namespace.c b/fs/namespace.c > index 4f39c4aba85d..a9065a9ab971 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -5072,13 +5072,30 @@ static int statmount_mnt_opts(struct kstatmount *s, struct seq_file *seq) > return 0; > } > > +static inline int statmount_opt_unescape(struct seq_file *seq, char *buf_start) > +{ > + char *buf_end, *opt_start, *opt_end; > + int count = 0; > + > + buf_end = seq->buf + seq->count; > + *buf_end = '\0'; > + for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) { > + opt_end = strchrnul(opt_start, ','); > + *opt_end = '\0'; > + buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1; > + if (WARN_ON_ONCE(++count == INT_MAX)) > + return -EOVERFLOW; > + } > + seq->count = buf_start - 1 - seq->buf; > + return count; > +} > + > static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq) > { > struct vfsmount *mnt = s->mnt; > struct super_block *sb = mnt->mnt_sb; > size_t start = seq->count; > - char *buf_start, *buf_end, *opt_start, *opt_end; > - u32 count = 0; > + char *buf_start; > int err; > > if (!sb->s_op->show_options) > @@ -5095,17 +5112,39 @@ static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq) > if (seq->count == start) > return 0; > > - buf_end = seq->buf + seq->count; > - *buf_end = '\0'; > - for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) { > - opt_end = strchrnul(opt_start, ','); > - *opt_end = '\0'; > - buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1; > - if (WARN_ON_ONCE(++count == 0)) > - return -EOVERFLOW; > - } > - seq->count = buf_start - 1 - seq->buf; > - s->sm.opt_num = count; > + err = statmount_opt_unescape(seq, buf_start); > + if (err < 0) > + return err; > + > + s->sm.opt_num = err; > + return 0; > +} > + > +static int statmount_opt_sec_array(struct kstatmount *s, struct seq_file *seq) > +{ > + struct vfsmount *mnt = s->mnt; > + struct super_block *sb = mnt->mnt_sb; > + size_t start = seq->count; > + char *buf_start; > + int err; > + > + buf_start = seq->buf + start; > + > + err = security_sb_show_options(seq, sb); > + if (!err) > + return err; > + > + if (unlikely(seq_has_overflowed(seq))) > + return -EAGAIN; > + > + if (seq->count == start) > + return 0; > + > + err = statmount_opt_unescape(seq, buf_start); > + if (err < 0) > + return err; > + > + s->sm.opt_sec_num = err; > return 0; > } > > @@ -5138,6 +5177,10 @@ static int statmount_string(struct kstatmount *s, u64 flag) > sm->opt_array = start; > ret = statmount_opt_array(s, seq); > break; > + case STATMOUNT_OPT_SEC_ARRAY: > + sm->opt_sec_array = start; > + ret = statmount_opt_sec_array(s, seq); > + break; > case STATMOUNT_FS_SUBTYPE: > sm->fs_subtype = start; > statmount_fs_subtype(s, seq); > @@ -5294,6 +5337,9 @@ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id, > if (!err && s->mask & STATMOUNT_OPT_ARRAY) > err = statmount_string(s, STATMOUNT_OPT_ARRAY); > > + if (!err && s->mask & STATMOUNT_OPT_SEC_ARRAY) > + err = statmount_string(s, STATMOUNT_OPT_SEC_ARRAY); > + > if (!err && s->mask & STATMOUNT_FS_SUBTYPE) > err = statmount_string(s, STATMOUNT_FS_SUBTYPE); > > @@ -5323,7 +5369,7 @@ static inline bool retry_statmount(const long ret, size_t *seq_size) > #define STATMOUNT_STRING_REQ (STATMOUNT_MNT_ROOT | STATMOUNT_MNT_POINT | \ > STATMOUNT_FS_TYPE | STATMOUNT_MNT_OPTS | \ > STATMOUNT_FS_SUBTYPE | STATMOUNT_SB_SOURCE | \ > - STATMOUNT_OPT_ARRAY) > + STATMOUNT_OPT_ARRAY | STATMOUNT_OPT_SEC_ARRAY) > > static int prepare_kstatmount(struct kstatmount *ks, struct mnt_id_req *kreq, > struct statmount __user *buf, size_t bufsize, > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h > index c0fda4604187..569d938a5757 100644 > --- a/include/uapi/linux/mount.h > +++ b/include/uapi/linux/mount.h > @@ -177,7 +177,9 @@ struct statmount { > __u32 sb_source; /* [str] Source string of the mount */ > __u32 opt_num; /* Number of fs options */ > __u32 opt_array; /* [str] Array of nul terminated fs options */ > - __u64 __spare2[47]; > + __u32 opt_sec_num; /* Number of security options */ > + __u32 opt_sec_array; /* [str] Array of nul terminated security options */ > + __u64 __spare2[45]; shouldn't that be 46 ? > char str[]; /* Variable size part containing strings */ > }; > > @@ -214,6 +216,7 @@ struct mnt_id_req { > #define STATMOUNT_FS_SUBTYPE 0x00000100U /* Want/got fs_subtype */ > #define STATMOUNT_SB_SOURCE 0x00000200U /* Want/got sb_source */ > #define STATMOUNT_OPT_ARRAY 0x00000400U /* Want/got opt_... */ > +#define STATMOUNT_OPT_SEC_ARRAY 0x00000800U /* Want/got opt_sec... */ > > /* > * Special @mnt_id values that can be passed to listmount The rest looks good to me though! -- Jeff Layton <jlayton@kernel.org>
On Thu, Nov 14, 2024 at 09:51:42AM -0500, Jeff Layton wrote: > On Thu, 2024-11-14 at 15:48 +0100, Christian Brauner wrote: > > On Thu, Nov 14, 2024 at 02:16:05PM +0100, Miklos Szeredi wrote: > > > On Thu, 14 Nov 2024 at 13:29, Jeff Layton <jlayton@kernel.org> wrote: > > > > > > > Ordinarily, I might agree, but we're now growing a new mount option > > > > field that has them separated by NULs. Will we need two extra fields > > > > for this? One comma-separated, and one NUL separated? > > > > > > > > /proc/#/mountinfo and mounts prepend these to the output of > > > > ->show_options, so the simple solution would be to just prepend those > > > > there instead of adding a new field. FWIW, only SELinux has any extra > > > > mount options to show here. > > > > > > Compromise: tack them onto the end of the comma separated list, but > > > add a new field for the nul separated security options. > > > > > > I think this would be logical, since the comma separated list is more > > > useful for having a /proc/$$/mountinfo compatible string than for > > > actually interpreting what's in there. > > > > Fair. Here's an incremental for the array of security options. > > > > diff --git a/fs/namespace.c b/fs/namespace.c > > index 4f39c4aba85d..a9065a9ab971 100644 > > --- a/fs/namespace.c > > +++ b/fs/namespace.c > > @@ -5072,13 +5072,30 @@ static int statmount_mnt_opts(struct kstatmount *s, struct seq_file *seq) > > return 0; > > } > > > > +static inline int statmount_opt_unescape(struct seq_file *seq, char *buf_start) > > +{ > > + char *buf_end, *opt_start, *opt_end; > > + int count = 0; > > + > > + buf_end = seq->buf + seq->count; > > + *buf_end = '\0'; > > + for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) { > > + opt_end = strchrnul(opt_start, ','); > > + *opt_end = '\0'; > > + buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1; > > + if (WARN_ON_ONCE(++count == INT_MAX)) > > + return -EOVERFLOW; > > + } > > + seq->count = buf_start - 1 - seq->buf; > > + return count; > > +} > > + > > static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq) > > { > > struct vfsmount *mnt = s->mnt; > > struct super_block *sb = mnt->mnt_sb; > > size_t start = seq->count; > > - char *buf_start, *buf_end, *opt_start, *opt_end; > > - u32 count = 0; > > + char *buf_start; > > int err; > > > > if (!sb->s_op->show_options) > > @@ -5095,17 +5112,39 @@ static int statmount_opt_array(struct kstatmount *s, struct seq_file *seq) > > if (seq->count == start) > > return 0; > > > > - buf_end = seq->buf + seq->count; > > - *buf_end = '\0'; > > - for (opt_start = buf_start + 1; opt_start < buf_end; opt_start = opt_end + 1) { > > - opt_end = strchrnul(opt_start, ','); > > - *opt_end = '\0'; > > - buf_start += string_unescape(opt_start, buf_start, 0, UNESCAPE_OCTAL) + 1; > > - if (WARN_ON_ONCE(++count == 0)) > > - return -EOVERFLOW; > > - } > > - seq->count = buf_start - 1 - seq->buf; > > - s->sm.opt_num = count; > > + err = statmount_opt_unescape(seq, buf_start); > > + if (err < 0) > > + return err; > > + > > + s->sm.opt_num = err; > > + return 0; > > +} > > + > > +static int statmount_opt_sec_array(struct kstatmount *s, struct seq_file *seq) > > +{ > > + struct vfsmount *mnt = s->mnt; > > + struct super_block *sb = mnt->mnt_sb; > > + size_t start = seq->count; > > + char *buf_start; > > + int err; > > + > > + buf_start = seq->buf + start; > > + > > + err = security_sb_show_options(seq, sb); > > + if (!err) > > + return err; > > + > > + if (unlikely(seq_has_overflowed(seq))) > > + return -EAGAIN; > > + > > + if (seq->count == start) > > + return 0; > > + > > + err = statmount_opt_unescape(seq, buf_start); > > + if (err < 0) > > + return err; > > + > > + s->sm.opt_sec_num = err; > > return 0; > > } > > > > @@ -5138,6 +5177,10 @@ static int statmount_string(struct kstatmount *s, u64 flag) > > sm->opt_array = start; > > ret = statmount_opt_array(s, seq); > > break; > > + case STATMOUNT_OPT_SEC_ARRAY: > > + sm->opt_sec_array = start; > > + ret = statmount_opt_sec_array(s, seq); > > + break; > > case STATMOUNT_FS_SUBTYPE: > > sm->fs_subtype = start; > > statmount_fs_subtype(s, seq); > > @@ -5294,6 +5337,9 @@ static int do_statmount(struct kstatmount *s, u64 mnt_id, u64 mnt_ns_id, > > if (!err && s->mask & STATMOUNT_OPT_ARRAY) > > err = statmount_string(s, STATMOUNT_OPT_ARRAY); > > > > + if (!err && s->mask & STATMOUNT_OPT_SEC_ARRAY) > > + err = statmount_string(s, STATMOUNT_OPT_SEC_ARRAY); > > + > > if (!err && s->mask & STATMOUNT_FS_SUBTYPE) > > err = statmount_string(s, STATMOUNT_FS_SUBTYPE); > > > > @@ -5323,7 +5369,7 @@ static inline bool retry_statmount(const long ret, size_t *seq_size) > > #define STATMOUNT_STRING_REQ (STATMOUNT_MNT_ROOT | STATMOUNT_MNT_POINT | \ > > STATMOUNT_FS_TYPE | STATMOUNT_MNT_OPTS | \ > > STATMOUNT_FS_SUBTYPE | STATMOUNT_SB_SOURCE | \ > > - STATMOUNT_OPT_ARRAY) > > + STATMOUNT_OPT_ARRAY | STATMOUNT_OPT_SEC_ARRAY) > > > > static int prepare_kstatmount(struct kstatmount *ks, struct mnt_id_req *kreq, > > struct statmount __user *buf, size_t bufsize, > > diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h > > index c0fda4604187..569d938a5757 100644 > > --- a/include/uapi/linux/mount.h > > +++ b/include/uapi/linux/mount.h > > @@ -177,7 +177,9 @@ struct statmount { > > __u32 sb_source; /* [str] Source string of the mount */ > > __u32 opt_num; /* Number of fs options */ > > __u32 opt_array; /* [str] Array of nul terminated fs options */ > > - __u64 __spare2[47]; > > + __u32 opt_sec_num; /* Number of security options */ > > + __u32 opt_sec_array; /* [str] Array of nul terminated security options */ > > + __u64 __spare2[45]; > > shouldn't that be 46 ? Yes, apparently I can't count. Thanks for noticing! > > > char str[]; /* Variable size part containing strings */ > > }; > > > > @@ -214,6 +216,7 @@ struct mnt_id_req { > > #define STATMOUNT_FS_SUBTYPE 0x00000100U /* Want/got fs_subtype */ > > #define STATMOUNT_SB_SOURCE 0x00000200U /* Want/got sb_source */ > > #define STATMOUNT_OPT_ARRAY 0x00000400U /* Want/got opt_... */ > > +#define STATMOUNT_OPT_SEC_ARRAY 0x00000800U /* Want/got opt_sec... */ > > > > /* > > * Special @mnt_id values that can be passed to listmount > > The rest looks good to me though! > -- > Jeff Layton <jlayton@kernel.org>
On Wed 13-11-24 08:45:06, Jeff Layton wrote: > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > Next on the wish list is a notification (a file descriptor that can be > > used in epoll) that returns a 64-bit ID when there is a change in the > > mount node. This will enable us to enhance systemd so that it does not > > have to read the entire mount table after every change. > > > > New fanotify events for mount table changes, perhaps? Now that I'm looking at it I'm not sure fanotify is a great fit for this usecase. A lot of fanotify functionality does not really work for virtual filesystems such as proc and hence we generally try to discourage use of fanotify for them. So just supporting one type of event (like FAN_MODIFY) on one file inside proc looks as rather inconsistent interface. But I vaguely remember we were discussing some kind of mount event, weren't we? Or was that for something else? Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 13/11/24 23:18, Jan Kara wrote: > On Wed 13-11-24 08:45:06, Jeff Layton wrote: >> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: >>> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: >>> Next on the wish list is a notification (a file descriptor that can be >>> used in epoll) that returns a 64-bit ID when there is a change in the >>> mount node. This will enable us to enhance systemd so that it does not >>> have to read the entire mount table after every change. >>> >> New fanotify events for mount table changes, perhaps? > Now that I'm looking at it I'm not sure fanotify is a great fit for this > usecase. A lot of fanotify functionality does not really work for virtual > filesystems such as proc and hence we generally try to discourage use of > fanotify for them. So just supporting one type of event (like FAN_MODIFY) > on one file inside proc looks as rather inconsistent interface. But I > vaguely remember we were discussing some kind of mount event, weren't we? > Or was that for something else? Well, yes, the idea was to be able to avoid the overhead of scanning the proc mount table(s) pretty much entirely, particularly bad for rapid fire event handling such as a largish number of rapid mounta or umounts. Ian
On 13/11/24 23:18, Jan Kara wrote: > On Wed 13-11-24 08:45:06, Jeff Layton wrote: >> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: >>> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: >>> Next on the wish list is a notification (a file descriptor that can be >>> used in epoll) that returns a 64-bit ID when there is a change in the >>> mount node. This will enable us to enhance systemd so that it does not >>> have to read the entire mount table after every change. >>> >> New fanotify events for mount table changes, perhaps? > Now that I'm looking at it I'm not sure fanotify is a great fit for this > usecase. A lot of fanotify functionality does not really work for virtual > filesystems such as proc and hence we generally try to discourage use of > fanotify for them. So just supporting one type of event (like FAN_MODIFY) > on one file inside proc looks as rather inconsistent interface. But I > vaguely remember we were discussing some kind of mount event, weren't we? > Or was that for something else? I still need to have a look at the existing notifications sub-systems but, tbh, I also don't think they offer the needed functionality. The thing that was most useful with David's notifications when I was trying to improve the mounts handling was the queuing interface. It allowed me to batch notifications up to around a couple of hundred and grab them in one go for processing. This significantly lowered the overhead of rapid fire event processing. The ability to go directly to an individual mount and get it's information only got about half the improvement I saw, the rest come from the notifications improvement. Ian
On Thu 14-11-24 09:45:23, Ian Kent wrote: > On 13/11/24 23:18, Jan Kara wrote: > > On Wed 13-11-24 08:45:06, Jeff Layton wrote: > > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > > > Next on the wish list is a notification (a file descriptor that can be > > > > used in epoll) that returns a 64-bit ID when there is a change in the > > > > mount node. This will enable us to enhance systemd so that it does not > > > > have to read the entire mount table after every change. > > > > > > > New fanotify events for mount table changes, perhaps? > > Now that I'm looking at it I'm not sure fanotify is a great fit for this > > usecase. A lot of fanotify functionality does not really work for virtual > > filesystems such as proc and hence we generally try to discourage use of > > fanotify for them. So just supporting one type of event (like FAN_MODIFY) > > on one file inside proc looks as rather inconsistent interface. But I > > vaguely remember we were discussing some kind of mount event, weren't we? > > Or was that for something else? > > I still need to have a look at the existing notifications sub-systems but, > tbh, I also don't think they offer the needed functionality. > > The thing that was most useful with David's notifications when I was trying > to improve the mounts handling was the queuing interface. It allowed me to > batch notifications up to around a couple of hundred and grab them in one go > for processing. This significantly lowered the overhead of rapid fire event > processing. The ability to go directly to an individual mount and get it's > information only got about half the improvement I saw, the rest come from > the notifications improvement. Well, if we implemented the mount notification events in fanotify, then the mount events get queued in the notification group queue and you can process the whole batch of events in one go if you want. So I don't see batching as an issue. What I'm more worried about is that watching the whole system for new mounts is going to be somewhat cumbersome when all you can do is to watch new mounts attached under an existing mount / filesystem. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On 14/11/24 19:56, Jan Kara wrote: > On Thu 14-11-24 09:45:23, Ian Kent wrote: >> On 13/11/24 23:18, Jan Kara wrote: >>> On Wed 13-11-24 08:45:06, Jeff Layton wrote: >>>> On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: >>>>> On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: >>>>> Next on the wish list is a notification (a file descriptor that can be >>>>> used in epoll) that returns a 64-bit ID when there is a change in the >>>>> mount node. This will enable us to enhance systemd so that it does not >>>>> have to read the entire mount table after every change. >>>>> >>>> New fanotify events for mount table changes, perhaps? >>> Now that I'm looking at it I'm not sure fanotify is a great fit for this >>> usecase. A lot of fanotify functionality does not really work for virtual >>> filesystems such as proc and hence we generally try to discourage use of >>> fanotify for them. So just supporting one type of event (like FAN_MODIFY) >>> on one file inside proc looks as rather inconsistent interface. But I >>> vaguely remember we were discussing some kind of mount event, weren't we? >>> Or was that for something else? >> I still need to have a look at the existing notifications sub-systems but, >> tbh, I also don't think they offer the needed functionality. >> >> The thing that was most useful with David's notifications when I was trying >> to improve the mounts handling was the queuing interface. It allowed me to >> batch notifications up to around a couple of hundred and grab them in one go >> for processing. This significantly lowered the overhead of rapid fire event >> processing. The ability to go directly to an individual mount and get it's >> information only got about half the improvement I saw, the rest come from >> the notifications improvement. > Well, if we implemented the mount notification events in fanotify, then the > mount events get queued in the notification group queue and you can process > the whole batch of events in one go if you want. So I don't see batching as > an issue. What I'm more worried about is that watching the whole system > for new mounts is going to be somewhat cumbersome when all you can do is to > watch new mounts attached under an existing mount / filesystem. But, for mounts/unounts for example, isn't it the act of performing the mount/unmount that triggers the notification if the path in within a file system that's marked to report such events? Ian
On Mon 18-11-24 07:29:42, Ian Kent wrote: > On 14/11/24 19:56, Jan Kara wrote: > > On Thu 14-11-24 09:45:23, Ian Kent wrote: > > > On 13/11/24 23:18, Jan Kara wrote: > > > > On Wed 13-11-24 08:45:06, Jeff Layton wrote: > > > > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > > > > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > > > > > Next on the wish list is a notification (a file descriptor that can be > > > > > > used in epoll) that returns a 64-bit ID when there is a change in the > > > > > > mount node. This will enable us to enhance systemd so that it does not > > > > > > have to read the entire mount table after every change. > > > > > > > > > > > New fanotify events for mount table changes, perhaps? > > > > Now that I'm looking at it I'm not sure fanotify is a great fit for this > > > > usecase. A lot of fanotify functionality does not really work for virtual > > > > filesystems such as proc and hence we generally try to discourage use of > > > > fanotify for them. So just supporting one type of event (like FAN_MODIFY) > > > > on one file inside proc looks as rather inconsistent interface. But I > > > > vaguely remember we were discussing some kind of mount event, weren't we? > > > > Or was that for something else? > > > I still need to have a look at the existing notifications sub-systems but, > > > tbh, I also don't think they offer the needed functionality. > > > > > > The thing that was most useful with David's notifications when I was trying > > > to improve the mounts handling was the queuing interface. It allowed me to > > > batch notifications up to around a couple of hundred and grab them in one go > > > for processing. This significantly lowered the overhead of rapid fire event > > > processing. The ability to go directly to an individual mount and get it's > > > information only got about half the improvement I saw, the rest come from > > > the notifications improvement. > > Well, if we implemented the mount notification events in fanotify, then the > > mount events get queued in the notification group queue and you can process > > the whole batch of events in one go if you want. So I don't see batching as > > an issue. What I'm more worried about is that watching the whole system > > for new mounts is going to be somewhat cumbersome when all you can do is to > > watch new mounts attached under an existing mount / filesystem. > > But, for mounts/unounts for example, isn't it the act of performing the > mount/unmount that triggers the notification if the path in within a file > system that's marked to report such events? Obviously it is the act of mounting / unmounting that will trigger the generation of the event. But I guess I don't understand what are you getting at... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Thu, 14 Nov 2024 at 12:56, Jan Kara <jack@suse.cz> wrote: > What I'm more worried about is that watching the whole system > for new mounts is going to be somewhat cumbersome when all you can do is to > watch new mounts attached under an existing mount / filesystem. We don't even know if there's a use case for that. I think it would make sense to think about it when/if such a use case emerges. The inode notification interfaces went through that evolution too, no? Thanks, Miklos
On Wed, 13 Nov 2024 at 16:18, Jan Kara <jack@suse.cz> wrote: > > On Wed 13-11-24 08:45:06, Jeff Layton wrote: > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > > Next on the wish list is a notification (a file descriptor that can be > > > used in epoll) that returns a 64-bit ID when there is a change in the > > > mount node. This will enable us to enhance systemd so that it does not > > > have to read the entire mount table after every change. > > > > > > > New fanotify events for mount table changes, perhaps? > > Now that I'm looking at it I'm not sure fanotify is a great fit for this > usecase. A lot of fanotify functionality does not really work for virtual > filesystems such as proc and hence we generally try to discourage use of > fanotify for them. So just supporting one type of event (like FAN_MODIFY) > on one file inside proc looks as rather inconsistent interface. But I > vaguely remember we were discussing some kind of mount event, weren't we? > Or was that for something else? Yeah, if memory serves right what we agreed on was that placing a watch on a mount would result in events being generated for mount/umount/move_mount directly under that mount. So this would not be monitoring the entire namespace as poll on /proc/$$/mountinfo does. IIRC Lennart said that this is okay and even desirable for systemd, since it's only interested in a particular set of mounts. Thanks, Miklos
On Wed 13-11-24 16:49:52, Miklos Szeredi wrote: > On Wed, 13 Nov 2024 at 16:18, Jan Kara <jack@suse.cz> wrote: > > > > On Wed 13-11-24 08:45:06, Jeff Layton wrote: > > > On Wed, 2024-11-13 at 12:27 +0100, Karel Zak wrote: > > > > On Tue, Nov 12, 2024 at 02:39:21PM GMT, Christian Brauner wrote: > > > > Next on the wish list is a notification (a file descriptor that can be > > > > used in epoll) that returns a 64-bit ID when there is a change in the > > > > mount node. This will enable us to enhance systemd so that it does not > > > > have to read the entire mount table after every change. > > > > > > > > > > New fanotify events for mount table changes, perhaps? > > > > Now that I'm looking at it I'm not sure fanotify is a great fit for this > > usecase. A lot of fanotify functionality does not really work for virtual > > filesystems such as proc and hence we generally try to discourage use of > > fanotify for them. So just supporting one type of event (like FAN_MODIFY) > > on one file inside proc looks as rather inconsistent interface. But I > > vaguely remember we were discussing some kind of mount event, weren't we? > > Or was that for something else? > > Yeah, if memory serves right what we agreed on was that placing a > watch on a mount would result in events being generated for > mount/umount/move_mount directly under that mount. So this would not > be monitoring the entire namespace as poll on /proc/$$/mountinfo does. > IIRC Lennart said that this is okay and even desirable for systemd, > since it's only interested in a particular set of mounts. Oh, right. Thanks for reminding me. And yes, this would fit within what fanotify supports quite nicely. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Wed, 13 Nov 2024 at 12:28, Karel Zak <kzak@redhat.com> wrote: > Next on the wish list is a notification (a file descriptor that can be > used in epoll) that returns a 64-bit ID when there is a change in the > mount node. This will enable us to enhance systemd so that it does not > have to read the entire mount table after every change. IIRC Amir said he'd take this up (this was at LSFMM 2023). I'm also happy to take part in the design and implementation. Thanks, Miklos
On Mon, 11 Nov 2024 at 16:10, Jeff Layton <jlayton@kernel.org> wrote: > > Meta has some internal logging that scrapes /proc/self/mountinfo today. > I'd like to convert it to use listmount()/statmount(), so we can do a > better job of monitoring with containers. We're missing some fields > though. This patchset adds them. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> When thinking of the naming for the unescaped options, I realized that maybe "source" is better than "sb_soure" which just adds redundant info. Just a thought, don't bother if you don't agree. Acked-by: Miklos Szeredi <mszeredi@redhat.com> Thanks, Miklos
On Tue, 2024-11-12 at 11:32 +0100, Miklos Szeredi wrote: > On Mon, 11 Nov 2024 at 16:10, Jeff Layton <jlayton@kernel.org> wrote: > > > > Meta has some internal logging that scrapes /proc/self/mountinfo today. > > I'd like to convert it to use listmount()/statmount(), so we can do a > > better job of monitoring with containers. We're missing some fields > > though. This patchset adds them. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > When thinking of the naming for the unescaped options, I realized that > maybe "source" is better than "sb_soure" which just adds redundant > info. Just a thought, don't bother if you don't agree. > > Acked-by: Miklos Szeredi <mszeredi@redhat.com> > I think sb_source fits better with the convention that the other fields in the struct use. I say we stick with that. -- Jeff Layton <jlayton@kernel.org>
© 2016 - 2024 Red Hat, Inc.