[v4] fanotify accounting for fs/splice.c

[PATCH v4 0/3] fanotify accounting for fs/splice.c

Posted by Ahelenia Ziemiańska 2 years, 7 months ago

Always generate modify out, access in for splice;
this gets automatically merged with no ugly special cases.

No changes to 2/3 or 3/3.

Ahelenia Ziemiańska (3):
  splice: always fsnotify_access(in), fsnotify_modify(out) on success
  splice: fsnotify_access(fd)/fsnotify_modify(fd) in vmsplice
  splice: fsnotify_access(in), fsnotify_modify(out) on success in tee

 fs/splice.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

Interdiff against v3:
diff --git a/fs/splice.c b/fs/splice.c
index 2ecfccbda956..bdbabc2ebfff 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1184,10 +1184,6 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 			out->f_pos = offset;
 		else
 			*off_out = offset;
-
-		// splice_write-> already marked out
-		// as modified via vfs_iter_write()
-		goto noaccessout;
 	} else if (opipe) {
 		if (off_out)
 			return -ESPIPE;
@@ -1211,11 +1207,10 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 	} else
 		return -EINVAL;
 
-	if (ret > 0)
+	if (ret > 0) {
 		fsnotify_modify(out);
-noaccessout:
-	if (ret > 0)
 		fsnotify_access(in);
+	}
 
 	return ret;
 }
-- 
2.39.2

Re: [PATCH v4 0/3] fanotify accounting for fs/splice.c

Posted by Jan Kara 2 years, 7 months ago

Hello!

On Tue 27-06-23 22:50:46, Ahelenia Ziemiańska wrote:
> Always generate modify out, access in for splice;
> this gets automatically merged with no ugly special cases.
> 
> No changes to 2/3 or 3/3.

Thanks for the patches Ahelena! The code looks fine to me but to be honest
I still have one unresolved question so let me think about it loud here for
documentation purposes :). Do we want fsnotify (any filesystem
notification framework like inotify or fanotify) to actually generate
events on FIFOs? FIFOs are virtual objects and are not part of the
filesystem as such (well, the inode itself and the name is), hence
*filesystem* notification framework does not seem like a great fit to watch
for changes or accesses there. And if we say "yes" for FIFOs, then why not
AF_UNIX sockets? Where do we draw the line? And is it all worth the
trouble?

I understand the convenience of inotify working on FIFOs for the "tail -f"
usecase but then wouldn't this better be fixed in tail(1) itself by using
epoll(7) for FIFOs which, as I've noted in my other reply, does not have
the problem that poll(2) has when there are no writers?

Another issue with FIFOs is that they do not have a concept of file
position. For hierarchical storage usecase we are introducing events that
will report file ranges being modified / accessed and officially supporting
FIFOs is one more special case to deal with.

What is supporting your changes is that fsnotify mostly works for FIFOs
already now (normal reads & writes generate notification) so splice not
working could be viewed as an inconsistency. Sockets (although they are
visible in the filesystem) cannot be open so for them the illusion of being
a file is even weaker.

So overall I guess I'm slightly in favor of making fsnotify generate events
on FIFOs even with splice, provided Amir does not see a big trouble in
supporting this with his upcoming HSM changes.

								Honza

> Ahelenia Ziemiańska (3):
>   splice: always fsnotify_access(in), fsnotify_modify(out) on success
>   splice: fsnotify_access(fd)/fsnotify_modify(fd) in vmsplice
>   splice: fsnotify_access(in), fsnotify_modify(out) on success in tee
> 
>  fs/splice.c | 38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> Interdiff against v3:
> diff --git a/fs/splice.c b/fs/splice.c
> index 2ecfccbda956..bdbabc2ebfff 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1184,10 +1184,6 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
>  			out->f_pos = offset;
>  		else
>  			*off_out = offset;
> -
> -		// splice_write-> already marked out
> -		// as modified via vfs_iter_write()
> -		goto noaccessout;
>  	} else if (opipe) {
>  		if (off_out)
>  			return -ESPIPE;
> @@ -1211,11 +1207,10 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
>  	} else
>  		return -EINVAL;
>  
> -	if (ret > 0)
> +	if (ret > 0) {
>  		fsnotify_modify(out);
> -noaccessout:
> -	if (ret > 0)
>  		fsnotify_access(in);
> +	}
>  
>  	return ret;
>  }
> -- 
> 2.39.2

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v4 0/3] fanotify accounting for fs/splice.c

Posted by Amir Goldstein 2 years, 7 months ago

On Wed, Jun 28, 2023 at 2:38 PM Jan Kara <jack@suse.cz> wrote:
>
> Hello!
>
> On Tue 27-06-23 22:50:46, Ahelenia Ziemiańska wrote:
> > Always generate modify out, access in for splice;
> > this gets automatically merged with no ugly special cases.
> >
> > No changes to 2/3 or 3/3.
>
> Thanks for the patches Ahelena! The code looks fine to me but to be honest
> I still have one unresolved question so let me think about it loud here for
> documentation purposes :). Do we want fsnotify (any filesystem
> notification framework like inotify or fanotify) to actually generate
> events on FIFOs? FIFOs are virtual objects and are not part of the
> filesystem as such (well, the inode itself and the name is), hence
> *filesystem* notification framework does not seem like a great fit to watch
> for changes or accesses there. And if we say "yes" for FIFOs, then why not
> AF_UNIX sockets? Where do we draw the line? And is it all worth the
> trouble?
>
> I understand the convenience of inotify working on FIFOs for the "tail -f"
> usecase but then wouldn't this better be fixed in tail(1) itself by using
> epoll(7) for FIFOs which, as I've noted in my other reply, does not have
> the problem that poll(2) has when there are no writers?
>
> Another issue with FIFOs is that they do not have a concept of file
> position. For hierarchical storage usecase we are introducing events that
> will report file ranges being modified / accessed and officially supporting
> FIFOs is one more special case to deal with.
>
> What is supporting your changes is that fsnotify mostly works for FIFOs
> already now (normal reads & writes generate notification) so splice not
> working could be viewed as an inconsistency. Sockets (although they are
> visible in the filesystem) cannot be open so for them the illusion of being
> a file is even weaker.
>
> So overall I guess I'm slightly in favor of making fsnotify generate events
> on FIFOs even with splice, provided Amir does not see a big trouble in
> supporting this with his upcoming HSM changes.
>

I've also thought about this.

The thing about the HSM events is that they are permission events
and just like FAN_ACCESS_PERM, they originate from the common
access control helpers {rw,remap}_verify_area(), which also happen
to have the file range info (with ppos NULL for pipes).

Ahelenia's patches do not add any new rw_verify_area() to pipes
so no new FAN_ACCESS_PERM events were added.

If we could go back to the design of fanotify we would have probably
made it explicit that permission events are only allowed on regular
files and dirs. For the new HSM events we can (and will) do that.

In any case, the new events are supposed to be delivered with
file access range records, so delivering HSM events on pipes
wouldn't make any sense.

So I do not see any problem with these patches wrt upcomping
HSM events.

However, note that these patches create more inconsistencies
between IN_ACCESS and FAN_ACCESS_PERM on pipes.

We can leave it at that if we want, but fixing the inconsistencies
by adding more FAN_ACCESS_PERM events on pipes - this
is not something that I wouldn't be comfortable with.

If anything, we can remove FAN_ACCESS_PERM events from
special files and see if anybody complains.

I don't know of any users of FAN_ACCESS_PERM and even for
FAN_OPEN_PERM, I don't think that AV-vendors have anything
useful to do with open permission events on special files.

Thanks,
Amir.

Re: [PATCH v4 0/3] fanotify accounting for fs/splice.c

Posted by Ahelenia Ziemiańska 2 years, 7 months ago

Hi!

On Wed, Jun 28, 2023 at 01:38:53PM +0200, Jan Kara wrote:
> On Tue 27-06-23 22:50:46, Ahelenia Ziemiańska wrote:
> > Always generate modify out, access in for splice;
> > this gets automatically merged with no ugly special cases.
> > 
> > No changes to 2/3 or 3/3.
> Thanks for the patches Ahelena! The code looks fine to me but to be honest
> I still have one unresolved question so let me think about it loud here for
> documentation purposes :). Do we want fsnotify (any filesystem
> notification framework like inotify or fanotify) to actually generate
> events on FIFOs? FIFOs are virtual objects and are not part of the
> filesystem as such (well, the inode itself and the name is), hence
> *filesystem* notification framework does not seem like a great fit to watch
> for changes or accesses there. And if we say "yes" for FIFOs, then why not
> AF_UNIX sockets? Where do we draw the line? And is it all worth the
> trouble?
As a relative outsider (I haven't used inotify before this, and have not
 been subjected to it or its peripheries before),
I interpreted inotify as being the Correct solution for:
  1. stuff you can find in a normal
     (non-/dev, you don't want to touch devices)
     filesystem traversal
  2. stuff you can open
where, going down the list in inode(7):
  S_IFSOCK   can't open
  S_IFLNK    can't open
  S_IFREG    yes!
  S_IFBLK    it's a device
  S_IFDIR    yes!
  S_IFCHR    it's a device
  S_IFIFO    yes!

It appears that I'm not the only one who's interpreted it that way,
especially since neither regular files nor pipes are pollable.
(Though, under that same categorisation, I wouldn't be surprised
 if anonymous pipes had been refused, for example, since those are
 conventionally unnameable.)

To this end, I'd say we're leaving the line precisely where it was drawn
before, even if by accident.

> I understand the convenience of inotify working on FIFOs for the "tail -f"
> usecase but then wouldn't this better be fixed in tail(1) itself by using
> epoll(7) for FIFOs which, as I've noted in my other reply, does not have
> the problem that poll(2) has when there are no writers?
Yes, epoll in ET mode returns POLLHUP only once, but you /also/ need the
inotify anyway for regular files, which epoll refuses
(and, with -F, you may want both epoll for a pipe and inotify for the
 directory it's contained in).
Is it possible to do? yes. Is it more annoying than just having pipes
report when they were written to? very much so.

inotify actually working(*) is presumably why coreutils tail doesn't use
epoll ‒ inotify already provides all required events(*), you can use the
same code for regular files and fifos, and with one fewer level of
indirection: there's just no need(*).

(*: except with a magic syscall only I use apparently)

Re: [PATCH v4 0/3] fanotify accounting for fs/splice.c

Posted by Jan Kara 2 years, 7 months ago

Hi!

On Wed 28-06-23 20:54:28, Ahelenia Ziemiańska wrote:
> On Wed, Jun 28, 2023 at 01:38:53PM +0200, Jan Kara wrote:
> > On Tue 27-06-23 22:50:46, Ahelenia Ziemiańska wrote:
> > > Always generate modify out, access in for splice;
> > > this gets automatically merged with no ugly special cases.
> > > 
> > > No changes to 2/3 or 3/3.
> > Thanks for the patches Ahelena! The code looks fine to me but to be honest
> > I still have one unresolved question so let me think about it loud here for
> > documentation purposes :). Do we want fsnotify (any filesystem
> > notification framework like inotify or fanotify) to actually generate
> > events on FIFOs? FIFOs are virtual objects and are not part of the
> > filesystem as such (well, the inode itself and the name is), hence
> > *filesystem* notification framework does not seem like a great fit to watch
> > for changes or accesses there. And if we say "yes" for FIFOs, then why not
> > AF_UNIX sockets? Where do we draw the line? And is it all worth the
> > trouble?
> As a relative outsider (I haven't used inotify before this, and have not
>  been subjected to it or its peripheries before),
> I interpreted inotify as being the Correct solution for:
>   1. stuff you can find in a normal
>      (non-/dev, you don't want to touch devices)
>      filesystem traversal
>   2. stuff you can open
> where, going down the list in inode(7):
>   S_IFSOCK   can't open
>   S_IFLNK    can't open
>   S_IFREG    yes!
>   S_IFBLK    it's a device
>   S_IFDIR    yes!
>   S_IFCHR    it's a device
>   S_IFIFO    yes!
> 
> It appears that I'm not the only one who's interpreted it that way,
> especially since neither regular files nor pipes are pollable.
> (Though, under that same categorisation, I wouldn't be surprised
>  if anonymous pipes had been refused, for example, since those are
>  conventionally unnameable.)
> 
> To this end, I'd say we're leaving the line precisely where it was drawn
> before, even if by accident.

I agree, although I'd note that there are S_IFREG inodes under /sys or
/proc where it would be too difficult to provide fsnotify events (exactly
because the file contents is not "data stored somewhere" but rather
something "generated on the fly") so the illusion is not perfect already.

> > I understand the convenience of inotify working on FIFOs for the "tail -f"
> > usecase but then wouldn't this better be fixed in tail(1) itself by using
> > epoll(7) for FIFOs which, as I've noted in my other reply, does not have
> > the problem that poll(2) has when there are no writers?
> Yes, epoll in ET mode returns POLLHUP only once, but you /also/ need the
> inotify anyway for regular files, which epoll refuses
> (and, with -F, you may want both epoll for a pipe and inotify for the
>  directory it's contained in).
> Is it possible to do? yes. Is it more annoying than just having pipes
> report when they were written to? very much so.
> 
> inotify actually working(*) is presumably why coreutils tail doesn't use
> epoll ‒ inotify already provides all required events(*), you can use the
> same code for regular files and fifos, and with one fewer level of
> indirection: there's just no need(*).
> 
> (*: except with a magic syscall only I use apparently)

Yeah, I've slept to this and I still think adding fsnotify events to splice
is a nicer option so feel free to add:

Acked-by: Jan Kara <jack@suse.cz>

to all kernel patches in your series. Since the changes are in splice code,
Christian or Al Viro (who you already have on CC list) should be merging
this so please make sure to also include them in the v5 submission.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [PATCH v4 0/3] fanotify accounting for fs/splice.c

Posted by Amir Goldstein 2 years, 7 months ago

On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska
<nabijaczleweli@nabijaczleweli.xyz> wrote:
>
> Always generate modify out, access in for splice;
> this gets automatically merged with no ugly special cases.

Ahelenia,

Obviously, you are new to sending patches to the kernel
and you appear to be a very enthusiastic and fast learner,
so I assume you won't mind getting some tips that you won't
find in any document.

a) CC LTP only on tests, not on the kernel patches

b) Please don't post these "diff" patches.
Developers (and bots) should be able to understand which
upstream commit a patch is based on (git format-patch provides that info).

I mean you can send those diff patches as part of a conversation to
explain yourself, that's fine, just don't post them as if they are patches
for review.

c) When there are prospect reviewers that have not reviewed v1
(especially inotify maintainer), it is better to wait at least one day posting
v2 and v3 and v4 ;), because:
1. It is better to accumulate review comments from several reviewers
2. Different reviewers may disagree, so if you are just following my
    advice you may need to go back and forth until everyone is happy
3. It's racy - reviewers may be in the middle of review of v1 without
    realizing that v2,v3,v4 is already in their inbox, so that's creating
    extra work for them - not a good outcome

Going to review v4....

Thanks,
Amir.

[PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Posted by Ahelenia Ziemiańska 2 years, 7 months ago

The current behaviour caused an asymmetry where some write APIs
(write, sendfile) would notify the written-to/read-from objects,
but splice wouldn't.

This affected userspace which uses inotify, most notably coreutils
tail -f, to monitor pipes.
If the pipe buffer had been filled by a splice-family function:
  * tail wouldn't know and thus wouldn't service the pipe, and
  * all writes to the pipe would block because it's full,
thus service was denied.
(For the particular case of tail -f this could be worked around
 with ---disable-inotify.)

Generate modify out before access in to let inotify merge the
modify out events in thr ipipe case.

Fixes: 983652c69199 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/splice.c | 26 +++++++++-----------------
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 3e06611d19ae..b5c7a5ae0e94 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1154,10 +1154,8 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 		if ((in->f_flags | out->f_flags) & O_NONBLOCK)
 			flags |= SPLICE_F_NONBLOCK;
 
-		return splice_pipe_to_pipe(ipipe, opipe, len, flags);
-	}
-
-	if (ipipe) {
+		ret = splice_pipe_to_pipe(ipipe, opipe, len, flags);
+	} else if (ipipe) {
 		if (off_in)
 			return -ESPIPE;
 		if (off_out) {
@@ -1182,18 +1180,11 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 		ret = do_splice_from(ipipe, out, &offset, len, flags);
 		file_end_write(out);
 
-		if (ret > 0)
-			fsnotify_modify(out);
-
 		if (!off_out)
 			out->f_pos = offset;
 		else
 			*off_out = offset;
-
-		return ret;
-	}
-
-	if (opipe) {
+	} else if (opipe) {
 		if (off_out)
 			return -ESPIPE;
 		if (off_in) {
@@ -1209,18 +1200,19 @@ long do_splice(struct file *in, loff_t *off_in, struct file *out,
 
 		ret = splice_file_to_pipe(in, opipe, &offset, len, flags);
 
-		if (ret > 0)
-			fsnotify_access(in);
-
 		if (!off_in)
 			in->f_pos = offset;
 		else
 			*off_in = offset;
+	} else
+		return -EINVAL;
 
-		return ret;
+	if (ret > 0) {
+		fsnotify_modify(out);
+		fsnotify_access(in);
 	}
 
-	return -EINVAL;
+	return ret;
 }
 
 static long __do_splice(struct file *in, loff_t __user *off_in,
-- 
2.39.2

Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Posted by Amir Goldstein 2 years, 7 months ago

On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska
<nabijaczleweli@nabijaczleweli.xyz> wrote:
>
> The current behaviour caused an asymmetry where some write APIs
> (write, sendfile) would notify the written-to/read-from objects,
> but splice wouldn't.
>
> This affected userspace which uses inotify, most notably coreutils
> tail -f, to monitor pipes.
> If the pipe buffer had been filled by a splice-family function:
>   * tail wouldn't know and thus wouldn't service the pipe, and
>   * all writes to the pipe would block because it's full,
> thus service was denied.
> (For the particular case of tail -f this could be worked around
>  with ---disable-inotify.)
>

Is my understanding of the tail code wrong?
My understanding was that tail_forever_inotify() is not called for
pipes, or is it being called when tailing a mixed collection of pipes
and regular files? If there are subtleties like those you need to
mention them , otherwise people will not be able to reproduce the
problem that you are describing.

I need to warn you about something regarding this patch -
often there are colliding interests among different kernel users -
fsnotify use cases quite often collide with the interest of users tracking
performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes
specifically have been the source of several performance regression reports
in the past and have driven optimizations like:

71d734103edf ("fsnotify: Rearrange fast path to minimise overhead
when there is no watcher")
e43de7f0862b ("fsnotify: optimize the case of no marks of any type")

The moral of this story is: even if your patches are accepted by fsnotify
reviewers, once they are staged for merging they will be subject to
performance regression tests and I can tell you with certainty that
performance regression will not be tolerated for the tail -f use case.
I will push your v4 patches to a branch in my github, to let the kernel
test bots run the performance regressions on it whenever they get to it.

Moreover, if coreutils will change tail -f to start setting inotify watches
on anonymous pipes (my understanding is that currently does not?),
then any tail -f on anonymous pipe can cripple the "no marks on sb"
performance optimization for all anonymous pipes and that would be
a *very* unfortunate outcome.

I think we need to add a rule to fanotify_events_supported() to ban
sb/mount marks on SB_KERNMOUNT and backport this
fix to LTS kernels (I will look into it) and then we can fine tune
the s_fsnotify_connectors optimization in fsnotify_parent() for
the SB_KERNMOUNT special case.
This may be able to save your patch for the faith of NACKed
for performance regression.

> Generate modify out before access in to let inotify merge the
> modify out events in thr ipipe case.

This comment is not clear and does not belong in this context,
but it very much belongs near the code in question.

Please wait to collect more feedback and specifically
to hear what Jan has to say about this hack before posting v5!

FYI, we are now in the beginning of the 6.5 "merge window",
which means that maintainers may be less responsive for the
next two weeks to non-critical patches as this one, which are
not targeted for the 6.5 kernel release.

Thanks,
Amir.

Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Posted by Ahelenia Ziemiańska 2 years, 7 months ago

On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote:
> On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska
> <nabijaczleweli@nabijaczleweli.xyz> wrote:
> > The current behaviour caused an asymmetry where some write APIs
> > (write, sendfile) would notify the written-to/read-from objects,
> > but splice wouldn't.
> >
> > This affected userspace which uses inotify, most notably coreutils
> > tail -f, to monitor pipes.
> > If the pipe buffer had been filled by a splice-family function:
> >   * tail wouldn't know and thus wouldn't service the pipe, and
> >   * all writes to the pipe would block because it's full,
> > thus service was denied.
> > (For the particular case of tail -f this could be worked around
> >  with ---disable-inotify.)
> Is my understanding of the tail code wrong?
> My understanding was that tail_forever_inotify() is not called for
> pipes, or is it being called when tailing a mixed collection of pipes
> and regular files? If there are subtleties like those you need to
> mention them , otherwise people will not be able to reproduce the
> problem that you are describing.
I can't squeak to the code itself, but it's trivial to check:
  $ tail -f  fifo &
  [1] 3213996
  $ echo zupa > fifo
  zupa
  $ echo zupa > fifo
  zupa
  $ echo zupa > fifo
  zupa
  $ cat /bin/tail > fifo
  # ...
  $ cat /bin/tail > fifo
hangs: the fifo is being watched with inotify.

This happens regardless of other files being specified.

tail -f doesn't follow FIFOs or pipes if they're fd 0
(guaranteed by POSIX, coreutils conforms).

OTOH, you could theoretically do
  $ cat | tail -f /dev/fd/3 3<&0
which first reads from the pipe until completion (⇔ hangup, cat died),
then hangs, because it's waiting for more data on the pipe.

This can never happen under a normal scenario, but doing
  $ echo zupa > /proc/3238590/fd/3
a few times reveals it's using classic 1/s polling
(and splicing to /proc/3238590/fd/3 actually yields that data being
 output from tail).

> I need to warn you about something regarding this patch -
> often there are colliding interests among different kernel users -
> fsnotify use cases quite often collide with the interest of users tracking
> performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes
> specifically have been the source of several performance regression reports
> in the past and have driven optimizations like:
> 
> 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead
> when there is no watcher")
> e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> 
> The moral of this story is: even if your patches are accepted by fsnotify
> reviewers, once they are staged for merging they will be subject to
> performance regression tests and I can tell you with certainty that
> performance regression will not be tolerated for the tail -f use case.
> I will push your v4 patches to a branch in my github, to let the kernel
> test bots run the performance regressions on it whenever they get to it.
> 
> Moreover, if coreutils will change tail -f to start setting inotify watches
> on anonymous pipes (my understanding is that currently does not?),
> then any tail -f on anonymous pipe can cripple the "no marks on sb"
> performance optimization for all anonymous pipes and that would be
> a *very* unfortunate outcome.
As seen above, it doesn't set inotify watches on anon pipes, and
(since it manages to distinguish "| /dev/fd/3 3<&0" from "fifo",
 so it must be going further than S_ISFIFO(fstat()))
this is an explicit design decision.

If you refuse setting inotifies on anon pipes then that likely won't
impact any userspace program (it's pathological, and for tail-like cases
 it'd only be meaningful for magic /proc/$pid/fd/* symlinks),
and if it's in the name of performance then no-one'll likely complain,
or even notice.

> I think we need to add a rule to fanotify_events_supported() to ban
> sb/mount marks on SB_KERNMOUNT and backport this
> fix to LTS kernels (I will look into it) and then we can fine tune
> the s_fsnotify_connectors optimization in fsnotify_parent() for
> the SB_KERNMOUNT special case.
> This may be able to save your patch for the faith of NACKed
> for performance regression.
This goes over my head, but if Jan says it makes sense
then it must do.

> > Generate modify out before access in to let inotify merge the
> > modify out events in thr ipipe case.
> This comment is not clear and does not belong in this context,
> but it very much belongs near the code in question.
Turned it into
		/*
		 * Generate modify out before access in:
		 * do_splice_from() may've already sent modify out,
		 * and this ensures the events get merged.
		 */
for v5.

Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Posted by Amir Goldstein 2 years, 7 months ago

On Wed, Jun 28, 2023 at 8:09 PM Ahelenia Ziemiańska
<nabijaczleweli@nabijaczleweli.xyz> wrote:
>
> On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote:
> > On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska
> > <nabijaczleweli@nabijaczleweli.xyz> wrote:
> > > The current behaviour caused an asymmetry where some write APIs
> > > (write, sendfile) would notify the written-to/read-from objects,
> > > but splice wouldn't.
> > >
> > > This affected userspace which uses inotify, most notably coreutils
> > > tail -f, to monitor pipes.
> > > If the pipe buffer had been filled by a splice-family function:
> > >   * tail wouldn't know and thus wouldn't service the pipe, and
> > >   * all writes to the pipe would block because it's full,
> > > thus service was denied.
> > > (For the particular case of tail -f this could be worked around
> > >  with ---disable-inotify.)
> > Is my understanding of the tail code wrong?
> > My understanding was that tail_forever_inotify() is not called for
> > pipes, or is it being called when tailing a mixed collection of pipes
> > and regular files? If there are subtleties like those you need to
> > mention them , otherwise people will not be able to reproduce the
> > problem that you are describing.
> I can't squeak to the code itself, but it's trivial to check:
>   $ tail -f  fifo &
>   [1] 3213996
>   $ echo zupa > fifo
>   zupa
>   $ echo zupa > fifo
>   zupa
>   $ echo zupa > fifo
>   zupa
>   $ cat /bin/tail > fifo
>   # ...
>   $ cat /bin/tail > fifo
> hangs: the fifo is being watched with inotify.
>
> This happens regardless of other files being specified.
>
> tail -f doesn't follow FIFOs or pipes if they're fd 0
> (guaranteed by POSIX, coreutils conforms).
>
> OTOH, you could theoretically do
>   $ cat | tail -f /dev/fd/3 3<&0
> which first reads from the pipe until completion (⇔ hangup, cat died),
> then hangs, because it's waiting for more data on the pipe.
>
> This can never happen under a normal scenario, but doing
>   $ echo zupa > /proc/3238590/fd/3
> a few times reveals it's using classic 1/s polling
> (and splicing to /proc/3238590/fd/3 actually yields that data being
>  output from tail).
>
> > I need to warn you about something regarding this patch -
> > often there are colliding interests among different kernel users -
> > fsnotify use cases quite often collide with the interest of users tracking
> > performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes
> > specifically have been the source of several performance regression reports
> > in the past and have driven optimizations like:
> >
> > 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead
> > when there is no watcher")
> > e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> >
> > The moral of this story is: even if your patches are accepted by fsnotify
> > reviewers, once they are staged for merging they will be subject to
> > performance regression tests and I can tell you with certainty that
> > performance regression will not be tolerated for the tail -f use case.
> > I will push your v4 patches to a branch in my github, to let the kernel
> > test bots run the performance regressions on it whenever they get to it.
> >
> > Moreover, if coreutils will change tail -f to start setting inotify watches
> > on anonymous pipes (my understanding is that currently does not?),
> > then any tail -f on anonymous pipe can cripple the "no marks on sb"
> > performance optimization for all anonymous pipes and that would be
> > a *very* unfortunate outcome.
> As seen above, it doesn't set inotify watches on anon pipes, and
> (since it manages to distinguish "| /dev/fd/3 3<&0" from "fifo",
>  so it must be going further than S_ISFIFO(fstat()))
> this is an explicit design decision.
>
> If you refuse setting inotifies on anon pipes then that likely won't
> impact any userspace program (it's pathological, and for tail-like cases
>  it'd only be meaningful for magic /proc/$pid/fd/* symlinks),
> and if it's in the name of performance then no-one'll likely complain,
> or even notice.
>

Unfortunately, it doesn't work this way - most of the time we are not
supposed to break existing applications and I have no way of knowing if
those applications exist...

> > I think we need to add a rule to fanotify_events_supported() to ban
> > sb/mount marks on SB_KERNMOUNT and backport this
> > fix to LTS kernels (I will look into it) and then we can fine tune
> > the s_fsnotify_connectors optimization in fsnotify_parent() for
> > the SB_KERNMOUNT special case.
> > This may be able to save your patch for the faith of NACKed
> > for performance regression.
> This goes over my head, but if Jan says it makes sense
> then it must do.
>

Here you go:
https://github.com/amir73il/linux/commits/fsnotify_pipe

I ended up using SB_NOUSER which is narrower than
SB_KERNMOUNT.

Care to test?
1) Functionally - that I did not break your tests.
2) Optimization - that when one anon pipe has an inotify watch
write to another anon pipe stops at fsnotify_inode_has_watchers()
and does not get to fsnotify().

> > > Generate modify out before access in to let inotify merge the
> > > modify out events in thr ipipe case.
> > This comment is not clear and does not belong in this context,
> > but it very much belongs near the code in question.
> Turned it into
>                 /*
>                  * Generate modify out before access in:
>                  * do_splice_from() may've already sent modify out,
>                  * and this ensures the events get merged.
>                  */
> for v5.

OK.

Thanks,
Amir.

Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Posted by Ahelenia Ziemiańska 2 years, 7 months ago

On Wed, Jun 28, 2023 at 09:38:03PM +0300, Amir Goldstein wrote:
> On Wed, Jun 28, 2023 at 8:09 PM Ahelenia Ziemiańska
> <nabijaczleweli@nabijaczleweli.xyz> wrote:
> > On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote:
> > > I think we need to add a rule to fanotify_events_supported() to ban
> > > sb/mount marks on SB_KERNMOUNT and backport this
> > > fix to LTS kernels (I will look into it) and then we can fine tune
> > > the s_fsnotify_connectors optimization in fsnotify_parent() for
> > > the SB_KERNMOUNT special case.
> > > This may be able to save your patch for the faith of NACKed
> > > for performance regression.
> > This goes over my head, but if Jan says it makes sense
> > then it must do.
> Here you go:
> https://github.com/amir73il/linux/commits/fsnotify_pipe
> 
> I ended up using SB_NOUSER which is narrower than
> SB_KERNMOUNT.
> 
> Care to test?
> 1) Functionally - that I did not break your tests.
) | gzip -d > inotify13; chmod +x inotify13; exec ./inotify13
tst_test.c:1560: TINFO: Timeout per run is 0h 00m 30s
inotify13.c:260: TINFO: file_to_pipe
inotify13.c:269: TPASS: ок
inotify13.c:260: TINFO: file_to_pipe
inotify13.c:269: TPASS: ок
inotify13.c:260: TINFO: splice_pipe_to_file
inotify13.c:269: TPASS: ок
inotify13.c:260: TINFO: pipe_to_pipe
inotify13.c:269: TPASS: ок
inotify13.c:260: TINFO: pipe_to_pipe
inotify13.c:269: TPASS: ок
inotify13.c:260: TINFO: vmsplice_pipe_to_mem
inotify13.c:269: TPASS: ок
inotify13.c:260: TINFO: vmsplice_mem_to_pipe
inotify13.c:269: TPASS: ок

Summary:
passed   7
failed   0
broken   0
skipped  0
warnings 0

The discrete tests from before also work as expected,
both to a fifo and an anon pipe.

> 2) Optimization - that when one anon pipe has an inotify watch
> write to another anon pipe stops at fsnotify_inode_has_watchers()
> and does not get to fsnotify().
Yes, I can confirm this as well: fsnotify_parent() only continues to
fsnotify() for the watched pipe; writes to other pipes early-exit.

To validate the counterfactual, I reverted "fsnotify: optimize the case
of anonymous pipe with no watches" and fsnotify() was being called
for each anon pipe write, so long as any anon pipe watches were registered.

Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Posted by Amir Goldstein 2 years, 7 months ago

On Wed, Jun 28, 2023 at 11:18 PM Ahelenia Ziemiańska
<nabijaczleweli@nabijaczleweli.xyz> wrote:
>
> On Wed, Jun 28, 2023 at 09:38:03PM +0300, Amir Goldstein wrote:
> > On Wed, Jun 28, 2023 at 8:09 PM Ahelenia Ziemiańska
> > <nabijaczleweli@nabijaczleweli.xyz> wrote:
> > > On Wed, Jun 28, 2023 at 09:33:43AM +0300, Amir Goldstein wrote:
> > > > I think we need to add a rule to fanotify_events_supported() to ban
> > > > sb/mount marks on SB_KERNMOUNT and backport this
> > > > fix to LTS kernels (I will look into it) and then we can fine tune
> > > > the s_fsnotify_connectors optimization in fsnotify_parent() for
> > > > the SB_KERNMOUNT special case.
> > > > This may be able to save your patch for the faith of NACKed
> > > > for performance regression.
> > > This goes over my head, but if Jan says it makes sense
> > > then it must do.
> > Here you go:
> > https://github.com/amir73il/linux/commits/fsnotify_pipe
> >
> > I ended up using SB_NOUSER which is narrower than
> > SB_KERNMOUNT.
> >
> > Care to test?
> > 1) Functionally - that I did not break your tests.
> ) | gzip -d > inotify13; chmod +x inotify13; exec ./inotify13
> tst_test.c:1560: TINFO: Timeout per run is 0h 00m 30s
> inotify13.c:260: TINFO: file_to_pipe
> inotify13.c:269: TPASS: ок
> inotify13.c:260: TINFO: file_to_pipe
> inotify13.c:269: TPASS: ок
> inotify13.c:260: TINFO: splice_pipe_to_file
> inotify13.c:269: TPASS: ок
> inotify13.c:260: TINFO: pipe_to_pipe
> inotify13.c:269: TPASS: ок
> inotify13.c:260: TINFO: pipe_to_pipe
> inotify13.c:269: TPASS: ок
> inotify13.c:260: TINFO: vmsplice_pipe_to_mem
> inotify13.c:269: TPASS: ок
> inotify13.c:260: TINFO: vmsplice_mem_to_pipe
> inotify13.c:269: TPASS: ок
>
> Summary:
> passed   7
> failed   0
> broken   0
> skipped  0
> warnings 0
>
> The discrete tests from before also work as expected,
> both to a fifo and an anon pipe.
>
> > 2) Optimization - that when one anon pipe has an inotify watch
> > write to another anon pipe stops at fsnotify_inode_has_watchers()
> > and does not get to fsnotify().
> Yes, I can confirm this as well: fsnotify_parent() only continues to
> fsnotify() for the watched pipe; writes to other pipes early-exit.
>
> To validate the counterfactual, I reverted "fsnotify: optimize the case
> of anonymous pipe with no watches" and fsnotify() was being called
> for each anon pipe write, so long as any anon pipe watches were registered.

Thank you for testing!

As Jan suggested, when you post v5, with my Reviewed-by and Jan's
Acked-by, please ask Christian to review and consider taking these
patches through the vfs tree for the 6.6 release.

Please include a link to your LTP test in the cover letter and a link to
my performance optimization patches.

Unless the kernel test bots detect a performance regression due to
your patches, I am not sure whether or not or when we will apply the
optimization patches.

Thanks,
Amir.

Re: [PATCH v4 1/3] splice: always fsnotify_access(in), fsnotify_modify(out) on success

Posted by Jan Kara 2 years, 7 months ago

On Wed 28-06-23 09:33:43, Amir Goldstein wrote:
> On Tue, Jun 27, 2023 at 11:50 PM Ahelenia Ziemiańska
> <nabijaczleweli@nabijaczleweli.xyz> wrote:
> >
> > The current behaviour caused an asymmetry where some write APIs
> > (write, sendfile) would notify the written-to/read-from objects,
> > but splice wouldn't.
> >
> > This affected userspace which uses inotify, most notably coreutils
> > tail -f, to monitor pipes.
> > If the pipe buffer had been filled by a splice-family function:
> >   * tail wouldn't know and thus wouldn't service the pipe, and
> >   * all writes to the pipe would block because it's full,
> > thus service was denied.
> > (For the particular case of tail -f this could be worked around
> >  with ---disable-inotify.)
> >
> 
> Is my understanding of the tail code wrong?
> My understanding was that tail_forever_inotify() is not called for
> pipes, or is it being called when tailing a mixed collection of pipes
> and regular files? If there are subtleties like those you need to
> mention them , otherwise people will not be able to reproduce the
> problem that you are describing.

Well, on my openSUSE 15.4 at least, tail -f does use inotify on FIFOs and
indeed when data is spliced to the FIFO, tail doesn't notice.

> I need to warn you about something regarding this patch -
> often there are colliding interests among different kernel users -
> fsnotify use cases quite often collide with the interest of users tracking
> performance regressions and IN_ACCESS/IN_MODIFY on anonymous pipes
> specifically have been the source of several performance regression reports
> in the past and have driven optimizations like:
> 
> 71d734103edf ("fsnotify: Rearrange fast path to minimise overhead
> when there is no watcher")
> e43de7f0862b ("fsnotify: optimize the case of no marks of any type")
> 
> The moral of this story is: even if your patches are accepted by fsnotify
> reviewers, once they are staged for merging they will be subject to
> performance regression tests and I can tell you with certainty that
> performance regression will not be tolerated for the tail -f use case.
> I will push your v4 patches to a branch in my github, to let the kernel
> test bots run the performance regressions on it whenever they get to it.
> 
> Moreover, if coreutils will change tail -f to start setting inotify watches
> on anonymous pipes (my understanding is that currently does not?),
> then any tail -f on anonymous pipe can cripple the "no marks on sb"
> performance optimization for all anonymous pipes and that would be
> a *very* unfortunate outcome.

Do you mean the "s_fsnotify_connectors" check? Yeah, a fsnotify watch on
any pipe inode is going to somewhat slow down the fsnotify calls for any
pipe. OTOH I don't expect inotify watches on pipe inodes to be common and
it is not like the overhead is huge. Also nobody really prevents you from
placing watch on pipe inode now with similar consequences, this patch only
makes it actually working with splice. So I'm not worried about the
performance impact. At least until somebody comes with a realistic
complaint ;-).

> I think we need to add a rule to fanotify_events_supported() to ban
> sb/mount marks on SB_KERNMOUNT and backport this
> fix to LTS kernels (I will look into it) and then we can fine tune
> the s_fsnotify_connectors optimization in fsnotify_parent() for
> the SB_KERNMOUNT special case.

Yeah, probably makes sense.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

[PATCH v4 2/3] splice: fsnotify_access(fd)/fsnotify_modify(fd) in vmsplice

Posted by Ahelenia Ziemiańska 2 years, 7 months ago

Same logic applies here: this can fill up the pipe and pollers that rely
on getting IN_MODIFY notifications never wake up.

Fixes: 983652c69199 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/splice.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index b5c7a5ae0e94..d3a7f4d5c078 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1341,6 +1341,9 @@ static long vmsplice_to_user(struct file *file, struct iov_iter *iter,
 		pipe_unlock(pipe);
 	}
 
+	if (ret > 0)
+		fsnotify_access(file);
+
 	return ret;
 }
 
@@ -1370,8 +1373,10 @@ static long vmsplice_to_pipe(struct file *file, struct iov_iter *iter,
 	if (!ret)
 		ret = iter_to_pipe(iter, pipe, buf_flag);
 	pipe_unlock(pipe);
-	if (ret > 0)
+	if (ret > 0) {
 		wakeup_pipe_readers(pipe);
+		fsnotify_modify(file);
+	}
 	return ret;
 }
 
-- 
2.39.2

[PATCH v4 3/3] splice: fsnotify_access(in), fsnotify_modify(out) on success in tee

Posted by Ahelenia Ziemiańska 2 years, 7 months ago

Same logic applies here: this can fill up the pipe, and pollers that rely
on getting IN_MODIFY notifications never wake up.

Fixes: 983652c69199 ("splice: report related fsnotify events")
Link: https://lore.kernel.org/linux-fsdevel/jbyihkyk5dtaohdwjyivambb2gffyjs3dodpofafnkkunxq7bu@jngkdxx65pux/t/#u
Link: https://bugs.debian.org/1039488
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
---
 fs/splice.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index d3a7f4d5c078..bdbabc2ebfff 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1810,6 +1810,11 @@ long do_tee(struct file *in, struct file *out, size_t len, unsigned int flags)
 		}
 	}
 
+	if (ret > 0) {
+		fsnotify_access(in);
+		fsnotify_modify(out);
+	}
+
 	return ret;
 }
 
-- 
2.39.2