[PATCH] fs/kernfs: implement STATX_BTIME

Max Kellermann posted 1 patch 9 months, 1 week ago
fs/kernfs/dir.c        | 2 ++
fs/kernfs/inode.c      | 6 ++++++
include/linux/kernfs.h | 7 +++++++
3 files changed, 15 insertions(+)
[PATCH] fs/kernfs: implement STATX_BTIME
Posted by Max Kellermann 9 months, 1 week ago
This allows finding out when an inode was initially created, for
example:

- when was a device plugged in (and its node in sysfs was created)?
- when was a cgroup created?

kernfs currently only implements `atime`, `mtime` and `ctime`.  All of
these are volatile (`mtime` and `ctime` get updated automatically, and
`atime` can be mainpulated using utime()).  Therefore, I suggest
implementing STATX_BTIME to have a reliable birth time in userspace.

Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/kernfs/dir.c        | 2 ++
 fs/kernfs/inode.c      | 6 ++++++
 include/linux/kernfs.h | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index fc70d72c3fe8..9a6857f2f3d7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -678,6 +678,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
 			goto err_out3;
 	}
 
+	ktime_get_real_ts64(&kn->btime);
+
 	return kn;
 
  err_out3:
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index b83054da68b3..1ff2ee62bfe6 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -189,6 +189,12 @@ int kernfs_iop_getattr(struct mnt_idmap *idmap,
 	struct kernfs_root *root = kernfs_root(kn);
 
 	down_read(&root->kernfs_iattr_rwsem);
+
+	if (request_mask & STATX_BTIME) {
+		stat->result_mask |= STATX_BTIME;
+		stat->btime = kn->btime;
+	}
+
 	kernfs_refresh_inode(kn, inode);
 	generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
 	up_read(&root->kernfs_iattr_rwsem);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index b5a5f32fdfd1..9332aadf4b48 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -229,6 +229,13 @@ struct kernfs_node {
 	void			*priv;
 	struct kernfs_iattrs	*iattr;
 
+	/*
+	 * The birth time (for STATX_BTIME).  It lives here and not in
+	 * struct kernfs_iattrs because the latter is only created on
+	 * demand, not at actual node birth time.
+	 */
+	struct timespec64	btime;
+
 	struct rcu_head		rcu;
 };
 
-- 
2.47.2
Re: [PATCH] fs/kernfs: implement STATX_BTIME
Posted by Greg KH 9 months, 1 week ago
On Tue, May 06, 2025 at 06:40:17PM +0200, Max Kellermann wrote:
> This allows finding out when an inode was initially created, for
> example:
> 
> - when was a device plugged in (and its node in sysfs was created)?
> - when was a cgroup created?
> 
> kernfs currently only implements `atime`, `mtime` and `ctime`.  All of
> these are volatile (`mtime` and `ctime` get updated automatically, and
> `atime` can be mainpulated using utime()).  Therefore, I suggest
> implementing STATX_BTIME to have a reliable birth time in userspace.
> 
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
>  fs/kernfs/dir.c        | 2 ++
>  fs/kernfs/inode.c      | 6 ++++++
>  include/linux/kernfs.h | 7 +++++++
>  3 files changed, 15 insertions(+)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index fc70d72c3fe8..9a6857f2f3d7 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -678,6 +678,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
>  			goto err_out3;
>  	}
>  
> +	ktime_get_real_ts64(&kn->btime);
> +
>  	return kn;
>  
>   err_out3:
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index b83054da68b3..1ff2ee62bfe6 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -189,6 +189,12 @@ int kernfs_iop_getattr(struct mnt_idmap *idmap,
>  	struct kernfs_root *root = kernfs_root(kn);
>  
>  	down_read(&root->kernfs_iattr_rwsem);
> +
> +	if (request_mask & STATX_BTIME) {
> +		stat->result_mask |= STATX_BTIME;
> +		stat->btime = kn->btime;
> +	}
> +
>  	kernfs_refresh_inode(kn, inode);
>  	generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
>  	up_read(&root->kernfs_iattr_rwsem);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index b5a5f32fdfd1..9332aadf4b48 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -229,6 +229,13 @@ struct kernfs_node {
>  	void			*priv;
>  	struct kernfs_iattrs	*iattr;
>  
> +	/*
> +	 * The birth time (for STATX_BTIME).  It lives here and not in
> +	 * struct kernfs_iattrs because the latter is only created on
> +	 * demand, not at actual node birth time.
> +	 */
> +	struct timespec64	btime;

You did just make this structure bigger, which has a real effect on many
systems (think 32bit s390 systems with 30k disks.)  Are you sure this is
really needed?

What userspace tools want this in such that they can not determine this
any other way?  What do they want this information for?  What is going
to depend and require this to warrent it being added like this?

I'm loath to increase the size of this structure just for "it would be
nice" type of things.  We need to see a real user and a real use case
for this please.

And knowing when a device shows up in the system isn't that, sorry, the
kernel log shows that for you already, right?

thanks,

greg k-h
Re: [PATCH] fs/kernfs: implement STATX_BTIME
Posted by Max Kellermann 9 months, 1 week ago
On Tue, May 6, 2025 at 7:44 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> You did just make this structure bigger, which has a real effect on many
> systems (think 32bit s390 systems with 30k disks.)  Are you sure this is
> really needed?

No, it's not "needed", and I understand that you question any increase
in struct sizes.

If you really worry about memory usage, kernfs has a lot of potential
for optimizations. One simple example: embed struct kernfs_node into
struct kobject/cgroup_file instead of having a pointer. That would
save 8 bytes plus the overhead for a memory allocation (i.e. that
alone would save more than my patch adds). Would you accept such an
optimization, if I were to submit a patch?

> What userspace tools want this in such that they can not determine this
> any other way?  What do they want this information for?  What is going
> to depend and require this to warrent it being added like this?

We have a process that reaps empty cgroups and reads resource usage
statistics (https://github.com/CM4all/spawn and
https://cm4all-spawn.readthedocs.io/en/latest/#resource-accounting),
and this process would like to know when that cgroup was created. That
means we can measure the lifetime duration of the cgroup, and for
example we can estimate the average CPU usage over the whole lifetime.
Using the cgroup's btime is the natural canonical way to determine
that time stamp, but cgroupfs doesn't implement it.

Sure, our container manager could store the birth time in an xattr ...
but that feels like a lousy kludge. If we have a concept of btime, I
should use that.

(It's okay if you don't like the patch and it doesn't get merged - as
always, I can happily keep it in our private kernel fork. I'm only
offering my work to everybody, because I'm a strong believer in open
source.)

> And knowing when a device shows up in the system isn't that, sorry, the
> kernel log shows that for you already, right?

That was a theoretical example that's of no interest for me currently
(just a side effect of my patch), but it might be interesting for
others.
But are you suggesting that programs should read and parse the kernel
log for that? I don't think any program should ever do that.

Max
Re: [PATCH] fs/kernfs: implement STATX_BTIME
Posted by Greg KH 8 months, 3 weeks ago
On Tue, May 06, 2025 at 08:30:26PM +0200, Max Kellermann wrote:
> On Tue, May 6, 2025 at 7:44 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> > You did just make this structure bigger, which has a real effect on many
> > systems (think 32bit s390 systems with 30k disks.)  Are you sure this is
> > really needed?
> 
> No, it's not "needed", and I understand that you question any increase
> in struct sizes.
> 
> If you really worry about memory usage, kernfs has a lot of potential
> for optimizations. One simple example: embed struct kernfs_node into
> struct kobject/cgroup_file instead of having a pointer. That would
> save 8 bytes plus the overhead for a memory allocation (i.e. that
> alone would save more than my patch adds). Would you accept such an
> optimization, if I were to submit a patch?

Sure, if that saves memory that would be a good change.

> > What userspace tools want this in such that they can not determine this
> > any other way?  What do they want this information for?  What is going
> > to depend and require this to warrent it being added like this?
> 
> We have a process that reaps empty cgroups and reads resource usage
> statistics (https://github.com/CM4all/spawn and
> https://cm4all-spawn.readthedocs.io/en/latest/#resource-accounting),
> and this process would like to know when that cgroup was created. That
> means we can measure the lifetime duration of the cgroup, and for
> example we can estimate the average CPU usage over the whole lifetime.
> Using the cgroup's btime is the natural canonical way to determine
> that time stamp, but cgroupfs doesn't implement it.
> 
> Sure, our container manager could store the birth time in an xattr ...
> but that feels like a lousy kludge. If we have a concept of btime, I
> should use that.
> 
> (It's okay if you don't like the patch and it doesn't get merged - as
> always, I can happily keep it in our private kernel fork. I'm only
> offering my work to everybody, because I'm a strong believer in open
> source.)

I feel like the creation time here is odd, but if others really
need/want it we could take it.  I would like to see others review it
though.

> > And knowing when a device shows up in the system isn't that, sorry, the
> > kernel log shows that for you already, right?
> 
> That was a theoretical example that's of no interest for me currently
> (just a side effect of my patch), but it might be interesting for
> others.
> But are you suggesting that programs should read and parse the kernel
> log for that? I don't think any program should ever do that.

Many programs read the kernel log for various reasons, that's why we
have some common macros/functions for displaying data properly there,
like the dev_err() call.

thanks,

greg k-h
Re: [PATCH] fs/kernfs: implement STATX_BTIME
Posted by Tejun Heo 8 months, 3 weeks ago
On Wed, May 21, 2025 at 01:24:36PM +0200, Greg KH wrote:
...
> > (It's okay if you don't like the patch and it doesn't get merged - as
> > always, I can happily keep it in our private kernel fork. I'm only
> > offering my work to everybody, because I'm a strong believer in open
> > source.)
> 
> I feel like the creation time here is odd, but if others really
> need/want it we could take it.  I would like to see others review it
> though.

Isn't this trivially trackable from userspace with udev?

Thanks.

-- 
tejun
Re: [PATCH] fs/kernfs: implement STATX_BTIME
Posted by Max Kellermann 8 months, 3 weeks ago
On Wed, May 21, 2025 at 6:18 PM Tejun Heo <tj@kernel.org> wrote:
> Isn't this trivially trackable from userspace with udev?

udev doesn't track the birth time of arbitrary cgroups, and I wouldn't
call udev "trivial". How many lines of C code would you need to talk
to udev to find something out and how many CPU cycles does that take?
One statx() is trivial and cheap, and the btime field is exactly meant
for this. It's the dedicated API for exactly this. (I'm going to use
it, even if the rest of the world doesn't.)