vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

[PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 1 week, 1 day ago

This patchset is for VFS.

Recently we got a lot of vulnerabilities in splice/vmsplice.

Also vmsplice already was source of vulnerabilities in the past:
CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).

Also vmsplice is problematic for other reasons. Here is what other
developers say:

Linus Torvalds in 2023:
> So I'd personally be perfectly ok with just making vmsplice() be
> exactly the same as write, and turn all of vmsplice() into just "it's
> a read() if the pipe is open for read, and a write if it's open for
> writing".
https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/

Christoph Hellwig in May 2026:
> vmsplice is the worst, as it is one of the few remaining places that
> can incorrectly dirty file backed pages without telling the file system
> and cause the other problems fixed by a FOLL_PIN conversion, but it is
> the only one where we do not have any idea yet how we could convert it
> to FOLL_PIN due to the unbounded pin time.
https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/

See recent discussion here:
https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u

For all these reasons I propose to make vmsplice a simple wrapper for
preadv2/pwritev2.

vmsplice(fd, vec, vlen, vmsplice_flags) will
be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
writable pipe.

SPLICE_F_NONBLOCK is translated to RWF_NOWAIT, all other SPLICE_F_*
flags are ignored.

There is a small change to handling of NONBLOCK-related flags,
see commit messages for details.

I tested this patch in Qemu.

This patchset was written by me, not by LLMs.

Askar Safin (3):
  tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
  vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
  splice: remove PIPE_BUF_FLAG_GIFT

 fs/fuse/dev.c             |   1 -
 fs/read_write.c           |  23 +++++
 fs/splice.c               | 202 +-------------------------------------
 include/linux/pipe_fs_i.h |   1 -
 include/linux/skbuff.h    |   4 +-
 include/linux/splice.h    |   2 +-
 include/linux/syscalls.h  |   4 +-
 7 files changed, 33 insertions(+), 204 deletions(-)


base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d (7.1-rc5)
-- 
2.47.3

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Pedro Falcato 1 week ago

On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> This patchset is for VFS.
> 
> Recently we got a lot of vulnerabilities in splice/vmsplice.
> 
> Also vmsplice already was source of vulnerabilities in the past:
> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
> 
> Also vmsplice is problematic for other reasons. Here is what other
> developers say:
> 
> Linus Torvalds in 2023:
> > So I'd personally be perfectly ok with just making vmsplice() be
> > exactly the same as write, and turn all of vmsplice() into just "it's
> > a read() if the pipe is open for read, and a write if it's open for
> > writing".
> https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/
> 
> Christoph Hellwig in May 2026:
> > vmsplice is the worst, as it is one of the few remaining places that
> > can incorrectly dirty file backed pages without telling the file system
> > and cause the other problems fixed by a FOLL_PIN conversion, but it is
> > the only one where we do not have any idea yet how we could convert it
> > to FOLL_PIN due to the unbounded pin time.
> https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/
> 
> See recent discussion here:
> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u

So, you took an ongoing discussion with an ongoing RFC patchset, and you
decided to reimplement part of the idea on your own, as a concurrent patchset.

Riiiiiight.... I don't think I have to NAK this, do I?

> 
> For all these reasons I propose to make vmsplice a simple wrapper for
> preadv2/pwritev2.
> 
> vmsplice(fd, vec, vlen, vmsplice_flags) will
> be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
> readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
> writable pipe.

This does not work. https://codesearch.debian.net/search?q=vmsplice%28&literal=1
There are users.

-- 
Pedro

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 5 days, 10 hours ago

Pedro Falcato <pfalcato@suse.de>:
> On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> > See recent discussion here:
> > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> 
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Riiiiiight.... I don't think I have to NAK this, do I?

Okay, possibly this was indeed inappropriate.

So this time I'm asking explicitly: is it okay to post new patchset?

I want to post patchset, which will remove pagecache-to-pipe splice.

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Pedro Falcato 5 days, 9 hours ago

On Wed, Jun 03, 2026 at 12:12:42AM +0300, Askar Safin wrote:
> Pedro Falcato <pfalcato@suse.de>:
> > On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> > > See recent discussion here:
> > > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> > 
> > So, you took an ongoing discussion with an ongoing RFC patchset, and you
> > decided to reimplement part of the idea on your own, as a concurrent patchset.
> > 
> > Riiiiiight.... I don't think I have to NAK this, do I?
> 
> Okay, possibly this was indeed inappropriate.
> 
> So this time I'm asking explicitly: is it okay to post new patchset?
> 
> I want to post patchset, which will remove pagecache-to-pipe splice.

Well, that's most definitely part of my patch. Also, you cannot outright
remove splice() functionality, it's pretty important (besides people doing
funky pipe business, it can also used for stuff like "take these pages that
we just got on a socket, put them on a pipe and then ship them off to an
actual file" with minimal copying; doing stuff like sendfile() also uses
splice() internally).

So, I guess I'll be sending the v2 soon.

-- 
Pedro

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 5 days, 9 hours ago

On Tue, 2 Jun 2026 at 14:37, Pedro Falcato <pfalcato@suse.de> wrote:
>
> Well, that's most definitely part of my patch. Also, you cannot outright
> remove splice() functionality

That isn't what Askar's patch ever did.

You apparently didn't even read it.

Honestly, I think you are the one out of line here.

Askar did something I suggested years ago, and didn't remove any functionality.

It just changes vmsplice to be a copying model (one of the directions
already was). It doesn't change regular splice at all.

And yes, it has the potential to be a visible behavior difference - if
some insane user uses vmsplice and then modifies the buffer
*afterwards*, then that would be semantically different between a
zero-copy and a normal copy.

But that would be insane behavior, and was never really reliable
anyway even with zero-copy (ie subsequent writes to user space buffers
would potentially do COW breaking based purely on timing and memory
pressure etc, so anybody who relied on it being visible wasn't goign
to get it realiably anyway)

Perhaps more importantly, it has the potential to change performance -
zero-copy *can* be a performance win, although typically it really
doesn't tend to be (looking up the page mapping is often slower than
copying).

I would expect it to be very clear in trivial benchmarks that aren't
actually real loads. And probably not visible anywhere else.

But your responses have been making it clear that you didn't seem to
actually look at the patch or the history of it.

Trying to make it look like Askar is the problem is only making you look worse.

Anyway, the vmsplice() thing is queued up in Christian's tree, and I
guess we'll see if anybody even notices anything.

              Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 5 days, 8 hours ago

Linus Torvalds <torvalds@linux-foundation.org>:
> That isn't what Askar's patch ever did.
> 
> You apparently didn't even read it.
> 
> Honestly, I think you are the one out of line here.
> 
> Askar did something I suggested years ago, and didn't remove any functionality.
> 
> It just changes vmsplice to be a copying model (one of the directions
> already was). It doesn't change regular splice at all.

Pedro is talking here not about this vmsplice patch, but about
my future hypothetical patch, which will remove splice-pagecache-to-pipe.

Let me clarify, what I want to send: I will make splice-pagecache-to-pipe
be a copy. I. e. this splice direction will continue to work, but will be
possibly slower. I. e. I will do something like this (see end of this email)
(absolutely not tested), and the same thing for other filesystems,
and also I will remove resulting dead code and remove
pipe_buf_operations::confirm (it will likely become unneeded).

If Pedro sends this instead, this will be okay.

diff --git i/fs/ext2/file.c w/fs/ext2/file.c
index d9b1eb34694a..8edcc3769793 100644
--- i/fs/ext2/file.c
+++ w/fs/ext2/file.c
@@ -326,7 +326,7 @@ const struct file_operations ext2_file_operations = {
        .release        = ext2_release_file,
        .fsync          = ext2_fsync,
        .get_unmapped_area = thp_get_unmapped_area,
-       .splice_read    = filemap_splice_read,
+       .splice_read    = copy_splice_read,
        .splice_write   = iter_file_splice_write,
        .setlease       = generic_setlease,
 };

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 5 days, 7 hours ago

On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
>
> Pedro is talking here not about this vmsplice patch, but about
> my future hypothetical patch, which will remove splice-pagecache-to-pipe.

That absolutely would be my suggested next step.

Something like the attached - get rid of filemap_splice_read()
entirely, and just replace it with copy_splice_read().

That also make the whole O_DIRECT and DAX special case just simply go away.

This is - in case there was any question about it - ENTIRELY untested.

It may not compile.

And if it does compile, it may do unspeakable things to your pets.

So think of this as nothing more than a "something like this". It does
leave "splice_read" around, and it intentionally just does that

   #define filemap_splice_read copy_splice_read

to not have to modify all the existing users one by one.

It would be interesting to hear if there are any actual real loads
that would ever notice?

                Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 5 days, 3 hours ago

On Tue, Jun 2, 2026 at 5:12 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
> >
> > Pedro is talking here not about this vmsplice patch, but about
> > my future hypothetical patch, which will remove splice-pagecache-to-pipe.
>
> That absolutely would be my suggested next step.
>
> Something like the attached - get rid of filemap_splice_read()
> entirely, and just replace it with copy_splice_read().

Am I understanding correctly that this will completely break zerocopy
sendfile?  sendfile is, internally, splice-to-a-secret-per-task-pipe
and then splice to the socket.  How much to people care?  These days,
a lot of high-bandwidth network senders are sending encrypted data,
which is not zerocopy frompagecache.  But there are surely some users
that care, for example the person who went to the effort to implement
IORING_OP_SPLICE:

commit 7d67af2c013402537385dae343a2d0f6a4cb3bfd
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Mon Feb 24 11:32:45 2020 +0300

    io_uring: add splice(2) support

Now maybe someone cares about a different path?  Splice from socket to
pipe to file?  Splice from socket to pipe to other socket?  Does
anyone do any of this?  One can, of course, recv() directly to an
mmapped file, but then you pay for page faults, so that probably a bad
idea in most cases.  At least all of these cases don't have spliced
buffers that refer to a potentially read-only file.

But I'm a little concerned that zerocopy sends from files to network
are actually important.

--Andy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Pedro Falcato 4 days, 19 hours ago

On Tue, Jun 02, 2026 at 08:51:03PM -0700, Andy Lutomirski wrote:
> On Tue, Jun 2, 2026 at 5:12 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Tue, 2 Jun 2026 at 15:54, Askar Safin <safinaskar@gmail.com> wrote:
> > >
> > > Pedro is talking here not about this vmsplice patch, but about
> > > my future hypothetical patch, which will remove splice-pagecache-to-pipe.
> >
> > That absolutely would be my suggested next step.
> >
> > Something like the attached - get rid of filemap_splice_read()
> > entirely, and just replace it with copy_splice_read().
> 
> Am I understanding correctly that this will completely break zerocopy
> sendfile?  sendfile is, internally, splice-to-a-secret-per-task-pipe
> and then splice to the socket.  How much to people care?  These days,
> a lot of high-bandwidth network senders are sending encrypted data,
> which is not zerocopy frompagecache.  But there are surely some users

You can do zerocopy from the page cache, even with TLS on top, by having
your (fancy) NIC do TLS offloading for you. See https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf.
Linux works similarly. Slide 26 is particularly interesting.
(No KTLS I assume is using simple sendmsg()'s from user memory, SW TLS
and NIC KTLS are both sendfile(), per the slides)

TL;DR I really do think it matters.

> 
> Now maybe someone cares about a different path?  Splice from socket to
> pipe to file?  Splice from socket to pipe to other socket?  Does
> anyone do any of this?  One can, of course, recv() directly to an
> mmapped file, but then you pay for page faults, so that probably a bad
> idea in most cases.  At least all of these cases don't have spliced
> buffers that refer to a potentially read-only file.
> 
> 
> But I'm a little concerned that zerocopy sends from files to network
> are actually important.
> 
> --Andy

-- 
Pedro

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Jakub Kicinski 4 days, 13 hours ago

On Wed, 3 Jun 2026 12:43:54 +0100 Pedro Falcato wrote:
> > Am I understanding correctly that this will completely break zerocopy
> > sendfile?  sendfile is, internally, splice-to-a-secret-per-task-pipe
> > and then splice to the socket.  How much to people care?  These days,
> > a lot of high-bandwidth network senders are sending encrypted data,
> > which is not zerocopy frompagecache.  But there are surely some users  
> 
> You can do zerocopy from the page cache, even with TLS on top, by having
> your (fancy) NIC do TLS offloading for you. See https://people.freebsd.org/~gallatin/talks/euro2019-ktls.pdf.
> Linux works similarly. Slide 26 is particularly interesting.
> (No KTLS I assume is using simple sendmsg()'s from user memory, SW TLS
> and NIC KTLS are both sendfile(), per the slides)

FTR this datapoint should come with the caveat that kTLS _offload_ does
not support TLS 1.3 today. So how much that configuration is used in
practice is unclear.

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 5 days, 3 hours ago

On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Am I understanding correctly that this will completely break zerocopy
> sendfile?

Very much, yes.

And it's worth making it very very clear that ABSOLUTELY NONE of the
recent big security bugs were in splice.

They were all in the networking and crypto code that just didn't deal
with shared data correctly.

So in that sense, it's a bit sad to discuss castrating splice.

But it's probably still the right thing to at least try.

I've seen very impressive benchmark numbers over the years, but
they've often smelled more like benchmarketing than actual real work.

There's also a real possibility that a lot of the sendfile / splice
advantage has little to do with zero-copy, and more to do with the
cost of mapping and maintaining buffers in user space.

If you are sending file data using plain reads and writes, it's not
just the "copy from user space to socket data structures".

There's also the cost of populating user space in the first place:
page faults for mmap made *that* historical copy avoidance basically a
fairy tale.

And not using mmap means that you have the cost of double caching in
the kernel _and_ user space etc.

So sendfile() as a concept (whether you use combinations of splice()
system calls or the sendfile system call itsefl) isn't necessarily
only about the zero-copy, it's really also about avoiding the user
space memory management.

But yes, there's a very real question of performance.

I just suspect we'll never get real answers without going the "let's
just see what happens" route...

                Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Stefan Metzmacher 2 days, 21 hours ago

Hi Linus,

>> Am I understanding correctly that this will completely break zerocopy
>> sendfile?
> 
> Very much, yes.
> 
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
> 
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
> 
> So in that sense, it's a bit sad to discuss castrating splice.
> 
> But it's probably still the right thing to at least try.
> 
> I've seen very impressive benchmark numbers over the years, but
> they've often smelled more like benchmarketing than actual real work.
> 
> There's also a real possibility that a lot of the sendfile / splice
> advantage has little to do with zero-copy, and more to do with the
> cost of mapping and maintaining buffers in user space.
> 
> If you are sending file data using plain reads and writes, it's not
> just the "copy from user space to socket data structures".
> 
> There's also the cost of populating user space in the first place:
> page faults for mmap made *that* historical copy avoidance basically a
> fairy tale.
> 
> And not using mmap means that you have the cost of double caching in
> the kernel _and_ user space etc.
> 
> So sendfile() as a concept (whether you use combinations of splice()
> system calls or the sendfile system call itsefl) isn't necessarily
> only about the zero-copy, it's really also about avoiding the user
> space memory management.

I don't think so. Ok, maybe for webservers just serving tiny
html files, that's true. But for me with Samba it's really the
copy_to/from_iter() that is the major factor.

We can use io_uring with IOSQE_ASYNC in order to offload
the memcpy cpu wasting to different cores, but it's still
wasting a lot of resources.

For the case of filesystem => socket, we can use
IORING_OP_SENDMSG_ZC and that at least removes the
copy_from_iter() in the sendmsg path, but the
IORING_OP_READV of buffers in the sizes up to 8MBytes
is wasting cpu in copy_to_iter().

For the case with smbdirect and RDMA offload with 2x200GBit/s links
changes from only ~33GBytes/s are used (and the server cpu even if using multiple cores)
is the limit. Without the memcpy waste ~46GByte/s is easily reached
and the limit is just the network link.

Maybe another solution could be having a version of
copy_to/from_iter that uses async_memcpy(), but didn't
have the time to experiment with that yet. Maybe a new flag
to preadv2/pwritev2 could control that, so that the
application can decide what's better.

But without an alternative please don't kill splice.

A lot of people are frustrated because they bought hardware
that is able to handle a lot of throughput, but
e.g. with the default of smb over tcp they get no
higher than 3.5GByte/s on a 100GBit/s link that's able
to handle ~11GBytes/s. And io_uring and splice are
a key factor to fix that.

Thanks!
metze

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by David Laight 2 days, 19 hours ago

On Fri, 5 Jun 2026 11:43:45 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> Hi Linus,
> 
> >> Am I understanding correctly that this will completely break zerocopy
> >> sendfile?  
> > 
> > Very much, yes.
> > 
> > And it's worth making it very very clear that ABSOLUTELY NONE of the
> > recent big security bugs were in splice.
> > 
> > They were all in the networking and crypto code that just didn't deal
> > with shared data correctly.
> > 
> > So in that sense, it's a bit sad to discuss castrating splice.
> > 
> > But it's probably still the right thing to at least try.
> > 
> > I've seen very impressive benchmark numbers over the years, but
> > they've often smelled more like benchmarketing than actual real work.
> > 
> > There's also a real possibility that a lot of the sendfile / splice
> > advantage has little to do with zero-copy, and more to do with the
> > cost of mapping and maintaining buffers in user space.
> > 
> > If you are sending file data using plain reads and writes, it's not
> > just the "copy from user space to socket data structures".
> > 
> > There's also the cost of populating user space in the first place:
> > page faults for mmap made *that* historical copy avoidance basically a
> > fairy tale.
> > 
> > And not using mmap means that you have the cost of double caching in
> > the kernel _and_ user space etc.
> > 
> > So sendfile() as a concept (whether you use combinations of splice()
> > system calls or the sendfile system call itsefl) isn't necessarily
> > only about the zero-copy, it's really also about avoiding the user
> > space memory management.  
> 
> I don't think so. Ok, maybe for webservers just serving tiny
> html files, that's true. But for me with Samba it's really the
> copy_to/from_iter() that is the major factor.

Is that copy also doing the ip checksum?
I really can't tell from the code (it does sometimes, even for tcp).
But I can't help feeling that optimisation is well past its sell by date.

-- David

> 
> We can use io_uring with IOSQE_ASYNC in order to offload
> the memcpy cpu wasting to different cores, but it's still
> wasting a lot of resources.
> 
> For the case of filesystem => socket, we can use
> IORING_OP_SENDMSG_ZC and that at least removes the
> copy_from_iter() in the sendmsg path, but the
> IORING_OP_READV of buffers in the sizes up to 8MBytes
> is wasting cpu in copy_to_iter().
> 
> For the case with smbdirect and RDMA offload with 2x200GBit/s links
> changes from only ~33GBytes/s are used (and the server cpu even if using multiple cores)
> is the limit. Without the memcpy waste ~46GByte/s is easily reached
> and the limit is just the network link.
> 
> Maybe another solution could be having a version of
> copy_to/from_iter that uses async_memcpy(), but didn't
> have the time to experiment with that yet. Maybe a new flag
> to preadv2/pwritev2 could control that, so that the
> application can decide what's better.
> 
> But without an alternative please don't kill splice.
> 
> A lot of people are frustrated because they bought hardware
> that is able to handle a lot of throughput, but
> e.g. with the default of smb over tcp they get no
> higher than 3.5GByte/s on a 100GBit/s link that's able
> to handle ~11GBytes/s. And io_uring and splice are
> a key factor to fix that.
> 
> Thanks!
> metze
>

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Stefan Metzmacher 2 days, 16 hours ago

Hi David,

>>> So sendfile() as a concept (whether you use combinations of splice()
>>> system calls or the sendfile system call itsefl) isn't necessarily
>>> only about the zero-copy, it's really also about avoiding the user
>>> space memory management.
>>
>> I don't think so. Ok, maybe for webservers just serving tiny
>> html files, that's true. But for me with Samba it's really the
>> copy_to/from_iter() that is the major factor.
> 
> Is that copy also doing the ip checksum?

Not in my tests. I guess there's offload in the network hardware
for this.

At least at the syscall layer of sendmsg() there's no checksuming
happening.

metze

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by David Laight 1 day, 21 hours ago

On Fri, 5 Jun 2026 17:20:34 +0200
Stefan Metzmacher <metze@samba.org> wrote:

> Hi David,
> 
> >>> So sendfile() as a concept (whether you use combinations of splice()
> >>> system calls or the sendfile system call itsefl) isn't necessarily
> >>> only about the zero-copy, it's really also about avoiding the user
> >>> space memory management.  
> >>
> >> I don't think so. Ok, maybe for webservers just serving tiny
> >> html files, that's true. But for me with Samba it's really the
> >> copy_to/from_iter() that is the major factor.  
> > 
> > Is that copy also doing the ip checksum?  
> 
> Not in my tests. I guess there's offload in the network hardware
> for this.

There will be, it is just whether the syscall checksum is actually
being suppressed.

-- David

> 
> At least at the syscall layer of sendmsg() there's no checksuming
> happening.
> 
> metze
>

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Jakub Kicinski 4 days, 13 hours ago

On Tue, 2 Jun 2026 21:20:13 -0700 Linus Torvalds wrote:
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
> 
> So in that sense, it's a bit sad to discuss castrating splice.

+1 IMVHO the networking bugs where people just not knowing what they
were doing. Presumably AI has scrounged all the occurrences of that
bug by now. I'd also hate to render splice optimizations moot based
on those bugs.

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 4 days, 13 hours ago

> On Jun 2, 2026, at 9:20 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> Am I understanding correctly that this will completely break zerocopy
>> sendfile?
>
> Very much, yes.
>
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
>
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
>
> So in that sense, it's a bit sad to discuss castrating splice.
>
> But it's probably still the right thing to at least try.
>
> I've seen very impressive benchmark numbers over the years, but
> they've often smelled more like benchmarketing than actual real work.
>
> There's also a real possibility that a lot of the sendfile / splice
> advantage has little to do with zero-copy, and more to do with the
> cost of mapping and maintaining buffers in user space.
>
> If you are sending file data using plain reads and writes, it's not
> just the "copy from user space to socket data structures".
>
> There's also the cost of populating user space in the first place:
> page faults for mmap made *that* historical copy avoidance basically a
> fairy tale.
>
> And not using mmap means that you have the cost of double caching in
> the kernel _and_ user space etc.
>
> So sendfile() as a concept (whether you use combinations of splice()
> system calls or the sendfile system call itsefl) isn't necessarily
> only about the zero-copy, it's really also about avoiding the user
> space memory management.

So maybe we should make sure that, if we go down the route of
disabling all the splice magic, that we leave an API, maybe the
existing sendfile or maybe something else, that does an optimized copy
from one fd to another and that is at least capable of sending from a
file to the network with at most one CPU-side copy.

Even if we’re just doing that, I continue to find it strange that we
require that a pipe be involved. What’s so special about pipes that we
allow splicing from file to pipe and then pipe to socket (this
requiring that the pipe retain a reference to the file’s page cache
structures to avoid *two* copies), but we can’t splice straight from
file to socket. Heck, even sendfile is implemented under the hood as a
pair of splices!

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 13 hours ago

On Wed, 3 Jun 2026 at 11:10, Andy Lutomirski <luto@amacapital.net> wrote:
>
> So maybe we should make sure that, if we go down the route of
> disabling all the splice magic, that we leave an API, maybe the
> existing sendfile or maybe something else, that does an optimized copy
> from one fd to another and that is at least capable of sending from a
> file to the network with at most one CPU-side copy.

Why?

That is *LITERALLY* the attack surface - and the complexity - that we
should be removing.

sendfile() was a mistake. It is literally the "file->socket" thing
that has been buggy.

I absolutely refuse to get rid of splice code but keep the buggy sh*t
cases that caused all the problems in the first place.

Because *THAT* would just be completely insane and pointless.

> Even if we’re just doing that, I continue to find it strange that we
> require that a pipe be involved. What’s so special about pipes

Again: it was never splice or the pipe that was the problem. Stop
barking up the wrong tree.

It was "file data to socket" that was the truly horrendous issue.

That said, to explain the pipe: The reason for the pipe is to act as
the kernel-side buffer.

Now, these days we have much more capable iov_iter interfaces than we
used to, and in that sense the "pipe as a buffer" is certainly not the
obvious choice now.

But even then you need to have a *handle* to the buffers for the
general case, and that's what the pipe fd ends up then still
effectively being.

It was also done to avoid the M:N translation problem, because people
wanted to do zero-copy between other things than just "file ->
socket".

But again: we're ABNSOLUTELY NOT keeping that "file -> socket" thing
and getting rid of splice.  That's literally keeping the bath-water
and throwing out the baby.

Splice is the *good* part (well, relatively - splice is bad too).

ile->socket needs to DIE IN A FIRE considering the security problems it has had.

I hope Jakub is right that the problems have been all fixed, and this
is all theoretical, but having seen just *how* many there were, I'm a
bit sceptical.

Because if people think splice is complicated, you haven't looked at
the skb rules. They are completely arbitrary and complex and spread
all over the tree.

               Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 4 days, 10 hours ago

On Wed, Jun 3, 2026 at 11:29 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 3 Jun 2026 at 11:10, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > So maybe we should make sure that, if we go down the route of
> > disabling all the splice magic, that we leave an API, maybe the
> > existing sendfile or maybe something else, that does an optimized copy
> > from one fd to another and that is at least capable of sending from a
> > file to the network with at most one CPU-side copy.
>
> Why?
>
> That is *LITERALLY* the attack surface - and the complexity - that we
> should be removing.

I think I buried the lede too much and you're arguing against what I
was trying not to say.

Maybe we should keep an API that does an optimized copy, from one fd
to another, that can send from a file to the network with at most ONE
cpu-side copy.  Not aiming for zero like sendfile / splice.  Aiming
for one.

If sendfile and splice get completely deoptimized (which I think makes
a considerable amount of sense), then I think that, as you said,
there's a risk that the most efficient way to send the contents of a
file to the network is to read it into user memory and then send it,
which is *two* copies to get it from pagecache to the outgoing socket
buffer.  But I think that just one copy can be done with essentially
no funny business.

copy_splice_read is conceptually not terrible at all -- it allocates
memory and copies from page cache.  But splice_to_socket involves
MSG_SPLACE_PAGES, which I think is a part of the mess that you
dislike.  And the path where one does copy_splice_read and then
splice_to_socket has to be a bit complex because of tee and (I think)
because splice_to_socket cannot assume that the incoming data is just
ordinary unshared buffers.

What I'm suggesting is that, at least for network families/protocols
that care to support such a thing, there could be a slightly tedious
but otherwise utterly boring path to *copy* from pagecache to socket
buffers.  So, once the copy is done, the skbs would be ordinary skbs,
exactly as if the user had called plain send(), and nothing downstream
(the network drivers, crazy crypto code, etc) would ever see the
difference.

I don't think I'm suggesting keeping *splice* as the user-visible API,
but maybe plain sendfile could do this, and maybe someone would add
io_uring support, but all the complexity would be confined to the code
that does the actual copy and not spread to anywhere else in the
network stack.

--Andy

>
> sendfile() was a mistake. It is literally the "file->socket" thing
> that has been buggy.
>
> I absolutely refuse to get rid of splice code but keep the buggy sh*t
> cases that caused all the problems in the first place.
>
> Because *THAT* would just be completely insane and pointless.
>
> > Even if we’re just doing that, I continue to find it strange that we
> > require that a pipe be involved. What’s so special about pipes
>
> Again: it was never splice or the pipe that was the problem. Stop
> barking up the wrong tree.
>
> It was "file data to socket" that was the truly horrendous issue.
>
> That said, to explain the pipe: The reason for the pipe is to act as
> the kernel-side buffer.
>
> Now, these days we have much more capable iov_iter interfaces than we
> used to, and in that sense the "pipe as a buffer" is certainly not the
> obvious choice now.
>
> But even then you need to have a *handle* to the buffers for the
> general case, and that's what the pipe fd ends up then still
> effectively being.
>
> It was also done to avoid the M:N translation problem, because people
> wanted to do zero-copy between other things than just "file ->
> socket".
>
> But again: we're ABNSOLUTELY NOT keeping that "file -> socket" thing
> and getting rid of splice.  That's literally keeping the bath-water
> and throwing out the baby.
>
> Splice is the *good* part (well, relatively - splice is bad too).
>
> ile->socket needs to DIE IN A FIRE considering the security problems it has had.
>
> I hope Jakub is right that the problems have been all fixed, and this
> is all theoretical, but having seen just *how* many there were, I'm a
> bit sceptical.
>
> Because if people think splice is complicated, you haven't looked at
> the skb rules. They are completely arbitrary and complex and spread
> all over the tree.
>
>                Linus

--
Andy Lutomirski
AMA Capital Management, LLC

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 4 days, 8 hours ago

Andy Lutomirski <luto@amacapital.net>:
> Maybe we should keep an API that does an optimized copy, from one fd
> to another, that can send from a file to the network with at most ONE
> cpu-side copy.  Not aiming for zero like sendfile / splice.  Aiming
> for one.

Yes, this is what my hypothetical future patch will do.

One copy from pagecache to pipe, and then network uses that buffer
directly.

> But splice_to_socket involves
> MSG_SPLACE_PAGES, which I think is a part of the mess that you
> dislike.  And the path where one does copy_splice_read and then
> splice_to_socket has to be a bit complex because of tee and (I think)
> because splice_to_socket cannot assume that the incoming data is just
> ordinary unshared buffers.

My future patch will provide new guarantee: pipe buffers are always
stable, i. e. they will not be externally-modified.

So hopefully network code will be adjusted to use this guarantee.

But pipe buffers will not be "ordinary unshared buffers".

They still may be shared with other things because of tee(2).
(But they are still stable! They will not be randomly modified!)

But network code can do "pipe_buf_try_steal" and thus ensure that
these buffers are not shared with anything else.

So, network code can be modified to use "pipe_buf_try_steal", and you
will get "ordinary unshared buffers" exactly as you want. This will
give you in total exactly one copy.

Also: as well as I understand, previously, pipe_buf_try_steal was
kind of lie. It may return true for buffers created via vmsplice with
GIFT. (I did not check this, but I think so.) I. e. pipe_buf_try_steal will
return "true" in this case, but pages are still shared! But, thanks to my
vmsplice patchset (which is already applied), this is no longer true!
So now pipe_buf_try_steal is absolutely safe to use!

Finally, we can degrade tee(2) to copy, and hopefully this will
allow us to always be sure that pipe buffers are not shared with anything.
This is possible future direction.

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 4 days, 8 hours ago

On Wed, Jun 3, 2026 at 3:43 PM Askar Safin <safinaskar@gmail.com> wrote:
>
> Andy Lutomirski <luto@amacapital.net>:
> > Maybe we should keep an API that does an optimized copy, from one fd
> > to another, that can send from a file to the network with at most ONE
> > cpu-side copy.  Not aiming for zero like sendfile / splice.  Aiming
> > for one.
>
> Yes, this is what my hypothetical future patch will do.
>
> One copy from pagecache to pipe, and then network uses that buffer
> directly.
>
> > But splice_to_socket involves
> > MSG_SPLACE_PAGES, which I think is a part of the mess that you
> > dislike.  And the path where one does copy_splice_read and then
> > splice_to_socket has to be a bit complex because of tee and (I think)
> > because splice_to_socket cannot assume that the incoming data is just
> > ordinary unshared buffers.
>
> My future patch will provide new guarantee: pipe buffers are always
> stable, i. e. they will not be externally-modified.
>
> So hopefully network code will be adjusted to use this guarantee.
>
> But pipe buffers will not be "ordinary unshared buffers".
>
> They still may be shared with other things because of tee(2).
> (But they are still stable! They will not be randomly modified!)
>
> But network code can do "pipe_buf_try_steal" and thus ensure that
> these buffers are not shared with anything else.
>
> So, network code can be modified to use "pipe_buf_try_steal", and you
> will get "ordinary unshared buffers" exactly as you want. This will
> give you in total exactly one copy.
>
> Also: as well as I understand, previously, pipe_buf_try_steal was
> kind of lie. It may return true for buffers created via vmsplice with
> GIFT. (I did not check this, but I think so.) I. e. pipe_buf_try_steal will
> return "true" in this case, but pages are still shared! But, thanks to my
> vmsplice patchset (which is already applied), this is no longer true!
> So now pipe_buf_try_steal is absolutely safe to use!
>
> Finally, we can degrade tee(2) to copy, and hopefully this will
> allow us to always be sure that pipe buffers are not shared with anything.
> This is possible future direction.

I'm a bit nervous that, if I've read the code correctly (a big if),
then iscsi and nvme will still send *shared* buffers via
MSG_SPLICE_PAGES, but that normal user code will not be able to do
this, and that something will bitrot.

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 4 days, 8 hours ago

Andy Lutomirski <luto@amacapital.net>:
> On Wed, Jun 3, 2026 at 3:43 PM Askar Safin <safinaskar@gmail.com> wrote:
> > Finally, we can degrade tee(2) to copy, and hopefully this will
> > allow us to always be sure that pipe buffers are not shared with anything.
> > This is possible future direction.
> 
> I'm a bit nervous that, if I've read the code correctly (a big if),
> then iscsi and nvme will still send *shared* buffers via
> MSG_SPLICE_PAGES, but that normal user code will not be able to do
> this, and that something will bitrot.

As well as I understand you correctly, you mean that if we remove
tee(2), then there still will be subsystems, which will be able to
send shared pages.

Yes, I totally agree.

So, if we remove tee(2), then we will probably need to remove all
non-standard implementations of pipe_buf_operations.

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 7 hours ago

On Wed, 3 Jun 2026 at 16:01, Askar Safin <safinaskar@gmail.com> wrote:
>
> So, if we remove tee(2), then we will probably need to remove all
> non-standard implementations of pipe_buf_operations.

I don't think tee matters.

Sure, it will share pages across pipes.

But if we make normal "splice to pipe" always copy from the page
cache, nobody cares.

You can corrupt the resulting pages as much as you want - through
multiple pipes if you use tee() to copy it - and it's all just
corrupting your private copy.

And yes, iSCSI and nvme might do their own splice-like thing, but
again, nobody really cares. When it's all kernel-internal, the attack
surface has gone away.

So that's why splice() (and vmsplice()) is special - not because it's
buggy, but because it's the user-facing attack surface to expose bugs
elsewhere.

             Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 9 hours ago

On Wed, 3 Jun 2026 at 14:31, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think I buried the lede too much and you're arguing against what I
> was trying not to say.
>
> Maybe we should keep an API that does an optimized copy, from one fd
> to another, that can send from a file to the network with at most ONE
> cpu-side copy.  Not aiming for zero like sendfile / splice.  Aiming
> for one.

Oh, absolutely - that's what my completely untested test patch  basically did.

The user space interface was still there.

And the networking side still continued to use the ->splice_write()
thing for writing to the socket.

It was just the filesystem side that basically now instead of exposing
the page cache directly (with filemap_splice_read) now only exposed a
*copy* of the page cache (with copy_splice_read).

                  Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 4 days, 9 hours ago

On Wed, Jun 3, 2026 at 2:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 3 Jun 2026 at 14:31, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > I think I buried the lede too much and you're arguing against what I
> > was trying not to say.
> >
> > Maybe we should keep an API that does an optimized copy, from one fd
> > to another, that can send from a file to the network with at most ONE
> > cpu-side copy.  Not aiming for zero like sendfile / splice.  Aiming
> > for one.
>
> Oh, absolutely - that's what my completely untested test patch  basically did.
>
> The user space interface was still there.
>
> And the networking side still continued to use the ->splice_write()
> thing for writing to the socket.

So I'm suspicious that you've possibly make bugs much (MUCH) harder to
exploit, but the underlying awful code and opportunity for bugs is
still there.  MSG_SPLICE_PAGES is still around, and there is still
(AFAICS) no actual coherent description of what it means.  There is
code that checks for it and apparently needs to do something special.
Foir example, some random kernel version I have checked out has this
delight in af_alg.c:

                /* use the existing memory in an allocated page */
                if (ctx->merge && !(msg->msg_flags & MSG_SPLICE_PAGES)) {

Grepping for MSG_SPLICE_PAGES come up with all kinds of terrors.
Check out the lovely comment in drivers/block/drbd/drbd_main.c, for
example...

And even with your patch, I think checking for MSG_SPLICE_PAGES still
matters: if I write to a pipe (using copy_splice_read or even just
plain write) and then I tee() that data, then I splice one of those
teed copies into a socket, then we hit ->sendmsg with MSG_SPLICE_PAGES
set, and we're hoping that the code does the right thing.  And maybe
all the bugs are fixed by now or maybe they're not.  Most of what your
patch accomplishes is breaking the connection between the buffers and
pagecache, so you can't poison /sbin/su.

It also seems kind of unfortunate that we can have skbs that contain
data that isn't actually owned by the socket in question, and, with
your patch applied, I'm wondering if the only case where this can
really happen is tee() and a handful of random drivers that send to
sockets.  (The ones in drivers/nvme/host/tcp.c and iSCSI seem like the
ones that people are likely to care about the most.)

I *think* that what I'm sort of suggesting is to drop this ability
from the kernel as well, or at least to consider it.  skbs would
always own their contents.  And something would get wired up so that
at least the cases of sendfile, nvme and iscsi to TCP or UDP sockets
would still works with only one copy, from the source page cache into
the socket buffer.

I suppose the counterargument is that, even if more bugs exist, it's a
bit hard to imagine a real attack involving tee, and one needs
privileges to set up nvme or iscsi aimed at an unusual socket type.

-- 
Andy Lutomirski
AMA Capital Management, LLC

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 8 hours ago

On Wed, 3 Jun 2026 at 15:23, Andy Lutomirski <luto@amacapital.net> wrote:
>
> So I'm suspicious that you've possibly make bugs much (MUCH) harder to
> exploit, but the underlying awful code and opportunity for bugs is
> still there.  MSG_SPLICE_PAGES is still around, and there is still
> (AFAICS) no actual coherent description of what it means.

I don't disagree. I've only looked at the filesystem side.

The networking side does some odd stuff too (and I did look at some of
that, and had to be edumacated by Jakub on some of the subtler rules
for what skb data sharing is ok and when it's not - really not my
area).

But at least MSG_SPLICE_PAGES should be kernel-internal only
interface, and once you don't share page cache pages with networking
code I think that kneecaps a lot of the attacks.

So that's really the aim here for me - at least _attempting_ to go
"maybe we can just limit splice enough that it doesn't even *matter*
when networking does something odd and questionable".

And it's entirely possible that the current zero-copy "networking gets
direct access to the page cache folios" is a huge and insurmountable
performance requirement for some loads. So the vmsplice patch - and
_particularly_ my suggested "let's try always copying" patch - may
simply be doomed.

But I'd rather try to simplify the splice code by removing complexity
- and possibly then failing and having to revert it and rethink things
- than not even trying.

Because I think splice() is a *cool* feature. It was always *clever*.
I just don't think it's worth the pain it has cause.

And it's been around for a long long time, and after more than two
decades it's still most definitely not _widely_ used.

So that makes it a failure in my book. Sometimes "clever" just isn't
the right thing.

               Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Stefan Metzmacher 2 days, 16 hours ago

Hi Linus,

> On Wed, 3 Jun 2026 at 15:23, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> So I'm suspicious that you've possibly make bugs much (MUCH) harder to
>> exploit, but the underlying awful code and opportunity for bugs is
>> still there.  MSG_SPLICE_PAGES is still around, and there is still
>> (AFAICS) no actual coherent description of what it means.
> 
> I don't disagree. I've only looked at the filesystem side.
> 
> The networking side does some odd stuff too (and I did look at some of
> that, and had to be edumacated by Jakub on some of the subtler rules
> for what skb data sharing is ok and when it's not - really not my
> area).
> 
> But at least MSG_SPLICE_PAGES should be kernel-internal only
> interface, and once you don't share page cache pages with networking
> code I think that kneecaps a lot of the attacks.
> 
> So that's really the aim here for me - at least _attempting_ to go
> "maybe we can just limit splice enough that it doesn't even *matter*
> when networking does something odd and questionable".

While prototyping a smbdirect_splice_to_bvecs() in order to
do use rdma_rw_ctx_init_bvec() I found things like pipe_buf_try_steal()
and dived a bit deeper into struct address_space and found things like:
mapping_mapped, mapping_tagged, mapping_deny_writable, mapping_allow_writable
and similar things.

With that I'm wondering if we could allow splicing of
pages only if nobody mmap'ed the file => mapping_mapped() returned 0
and the page is not tagged with any of PAGECACHE_TAG_{DIRTY,WRITEBACK,TOWRITE}
and once a page is spliced we tag the page in the i_pages xarray
with a PAGECACHE_TAG_SPLICED. In all other cases the page would be copied.

Then any call to do_mmap() or vfs_writev at the highlevel
and at the lower levels most likely filemap_get_entry()/filemap_map_pages()
will remove the pages marked with PAGECACHE_TAG_SPLICED
and allocate new pages used for the pagecache of the related index.
It would be a bit similar to invalidate_inode_pages2_range() for direct io
writes. Maybe optimizing by clearing PAGECACHE_TAG_SPLICED if the refcount
of the page is 1.

This would also mean the content of spliced pages won't be changed
by future writes to the file, which removes the problem with unstable
pages and checksums.

It means the most common workload, e.g. a file only opened for
file serving (or simple opens in general) would still be able to
be optimized.

Does that sound useful and doable?

metze

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 2 days, 15 hours ago

On Fri, 5 Jun 2026 at 08:15, Stefan Metzmacher <metze@samba.org> wrote:
>
> It means the most common workload, e.g. a file only opened for
> file serving (or simple opens in general) would still be able to
> be optimized.

Nope. If your web server opens files with write access, I'd be
extremely surprised.

And if you don't have write access, and you're sending out data from
files you opened just for reading - the onle sane case - you hit all
the existing problems with "I can certainly look up pages, but I damn
well shouldn't pass those pages to the networking code without copying
them".

               Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 9 hours ago

On Wed, 3 Jun 2026 at 14:36, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It was just the filesystem side that basically now instead of exposing
> the page cache directly (with filemap_splice_read) now only exposed a
> *copy* of the page cache (with copy_splice_read).

... and let me note that UNTESTED part again.

The patch looked "ObviouslyCorrect(tm)" to me, and I did actually
compile-test it too.

So it probably wasn't _complete_ crap.

But I never even booted it, and if I had, I wouldn't have had any
loads that uses splice (or sendfile) anyway.

So caveat emptor.

              Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 11 hours ago

On Wed, 3 Jun 2026 at 11:28, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But even then you need to have a *handle* to the buffers for the
> general case, and that's what the pipe fd ends up then still
> effectively being.

Again: for sendfile, you don't need the handle, because you can just
"read the file data again".

But the the handle is needed for any buffering that can't do that -
iow pretty much *any* other case than a file-backed source.

So the original use-cases included things like copying media data from
a TV capture card to a GPU for outputting in a window.

There it's actually the intermediate buffer that is the important
thing, and it needs to have a lifetime that is independent of the
system call itself, because the system call may be interrupted by
signals etc, and you can't just "read the data again" when you
restart.

So the whole idea with splice() is that you have an input, an output,
and a stateful buffer between the two that has a lifetime.

Having just a iov_iter isn't enough - even with the current much more
capable iov_iter we have now (compared to when splice came to be: two
decades ago when the modern iov_iter didn't even exist). You have to
have that notion of a buffer with a lifetime.

(iov_iter came a couple of years later, but it then took many many
years for it to become the powerful thing it is today where you can
put almost arbitrary data into it - it started as purely a user space
iovec iterator, all the bvec/kvec etc stuff that you need for IO
buffering came a decade later)

So there's historical reasons for the use of pipes, but there really
is a very fundamental reason for it too: wanting to *generic* data
transfer between two points, not sendfile.

It's worth noticing that in the generic case, zero-copy isn't really
even an issue.

When you think operations like "splice TV capture input to a pipe",
you typically need to allocate the pages that you then DMA into
*anyway*, and you'd just put those pages into the pipe. And the facty
that you can then just take the data directly from those pages when
you splice from the pipe to whatever GPU engine that does the decoding
is kind of secondary.

So again: the big deal with splice() and the pipe isn't really about
zero-copy. It's the in-kernel buffers where the drivers control the
allocation and you don't have some "user space allocates memory, then
kernel looks that allocation up and uses it" model.

Having less copies is kind of incidental. It *might* happen just
because it's natural when some streaming device just gives it data
away and doesn't care after the fact.

The problem with splicing from a file has been exactly the fact that
it's *not* streaming data, and the filesystem zero-copy case gave
direct access to the long-term cache.

Which is undoubtedly good for performance. But it fundamentally
*requires* that the sink is trustworthy. Which has been problematic.

That's why sendfile() is bad. Not because splice itself is a bad
concept, but because you have to have that absolute trust across
components.

          Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by David Howells 4 days, 12 hours ago

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Because if people think splice is complicated, you haven't looked at
> the skb rules. They are completely arbitrary and complex and spread
> all over the tree.

Yeah - I fell foul of the net loopback driver just reflecting the outgoing
packet back, complete with all the original spliced bufferage.  I was
wondering if the loopback driver needs to look at the skbuff, see if it has
zerocopy elements of some sort and, if so, copy it (or drop it if ENOMEM).

David

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Christian Brauner 5 days ago

On Tue, Jun 02, 2026 at 09:20:13PM -0700, Linus Torvalds wrote:
> On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > Am I understanding correctly that this will completely break zerocopy
> > sendfile?
> 
> Very much, yes.
> 
> And it's worth making it very very clear that ABSOLUTELY NONE of the
> recent big security bugs were in splice.
> 
> They were all in the networking and crypto code that just didn't deal
> with shared data correctly.
> 
> So in that sense, it's a bit sad to discuss castrating splice.

Well, we're completely ignoring the fact that splice()'s locking and
interactions with pipe_lock() are complete insanity. So unless someone
sits down and really thinks about how to rework the locking I think
degrading splice() is just fine.

> But it's probably still the right thing to at least try.

Yes.

> I just suspect we'll never get real answers without going the "let's
> just see what happens" route...

Yes.

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Christian Brauner 4 days, 17 hours ago

On Wed, Jun 03, 2026 at 08:45:18AM +0200, Christian Brauner wrote:
> On Tue, Jun 02, 2026 at 09:20:13PM -0700, Linus Torvalds wrote:
> > On Tue, 2 Jun 2026 at 20:51, Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > Am I understanding correctly that this will completely break zerocopy
> > > sendfile?
> > 
> > Very much, yes.
> > 
> > And it's worth making it very very clear that ABSOLUTELY NONE of the
> > recent big security bugs were in splice.
> > 
> > They were all in the networking and crypto code that just didn't deal
> > with shared data correctly.
> > 
> > So in that sense, it's a bit sad to discuss castrating splice.
> 
> Well, we're completely ignoring the fact that splice()'s locking and
> interactions with pipe_lock() are complete insanity. So unless someone
> sits down and really thinks about how to rework the locking I think
> degrading splice() is just fine.
> 
> > But it's probably still the right thing to at least try.
> 
> Yes.
> 
> > I just suspect we'll never get real answers without going the "let's
> > just see what happens" route...
> 
> Yes.

Reading this thread again I'm really amazed how willingly people argue
to remain locked into a really broken API even if they're giving a risk
but worthwhile chance to kill it for good. Anway, odd-userspace behavior
time:

David reported vmsplice01 failing in the LTP testsuite after the change:

11297 20:41:02.548383  <LAVA_SIGNAL_STARTTC vmsplice01>
11298 20:41:02.548518  tst_tmpdir.c:316: TINFO: Using /tmp/LTP_vmsZ13ZQj as tmpdir (tmpfs filesystem)
11299 20:41:02.548656  tst_test.c:2047: TINFO: LTP version: 20260130
11300 20:41:02.548793  tst_test.c:2050: TINFO: Tested kernel: 7.1.0-rc6-next-20260602 #1 SMP PREEMPT Tue Jun  2 18:13:29 UTC 2026 aarch64
11301 20:41:02.548932  tst_kconfig.c:88: TINFO: Parsing kernel config '/proc/config.gz'
11302 20:41:02.549069  tst_test.c:1875: TINFO: Overall timeout per run is 0h 01m 30s
11303 20:41:02.549205  tst_test.c:1632: TINFO: tmpfs is supported by the test
11304 20:41:02.549340  Test timeouted, sending SIGKILL!
11305 20:41:02.549477  tst_test.c:1947: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
11306 20:41:02.549614  tst_test.c:1949: TBROK: Test killed! (timeout?)
11307 20:41:02.549751  
11308 20:41:02.549887  Summary:
11309 20:41:02.550021  passed   0
11310 20:41:02.550155  failed   0
11311 20:41:02.550290  broken   1
11312 20:41:02.550450  skipped  0
11313 20:41:02.550582  warnings 0
11314 20:41:02.550710  
11315 20:41:02.550838  <LAVA_SIGNAL_ENDTC vmsplice01>

So I looked at the test:

	while (v.iov_len) {
		/*
		 * in a real app you'd be more clever with poll of course,
		 * here we are basically just blocking on output room and
		 * not using the free time for anything interesting.
		 */
		if (poll(&pfd, 1, -1) < 0)
			tst_brk(TBROK | TERRNO, "poll() failed");

		written = vmsplice(pipes[1], &v, 1, 0);
		if (written < 0) {
			tst_brk(TBROK | TERRNO, "vmsplice() failed");
		} else {
			if (written == 0) {
				break;
			} else {
				v.iov_base += written;
				v.iov_len -= written;
			}
		}

		SAFE_SPLICE(pipes[0], NULL, fd_out, &offset, written, 0);
		//printf("offset = %lld\n", (long long)offset);
	}

Prior to the change add_to_pipe() returns -EAGAIN the moment the pipe is
full. So iter_to_pipe stops and returns a partial count capped at pipe
capacity. For a 128K buffer over a 64K pipe the first call returns 64K,
the test drains it, call 2 returns the remaining 64K. Done.

After this change do_writev(... flags & SPLICE_F_NONBLOCK ? RWF_NOWAIT :
0) then calls pipe_write which does not stop when the pipe fills. It
blocks until the entire iovec is consumed.

I kinda think we need to preserve similar semantics.

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 16 hours ago

On Wed, 3 Jun 2026 at 06:40, Christian Brauner <brauner@kernel.org> wrote:
>
> Prior to the change add_to_pipe() returns -EAGAIN the moment the pipe is
> full. So iter_to_pipe stops and returns a partial count capped at pipe
> capacity. For a 128K buffer over a 64K pipe the first call returns 64K,
> the test drains it, call 2 returns the remaining 64K. Done.
>
> After this change do_writev(... flags & SPLICE_F_NONBLOCK ? RWF_NOWAIT :
> 0) then calls pipe_write which does not stop when the pipe fills. It
> blocks until the entire iovec is consumed.
>
> I kinda think we need to preserve similar semantics.

Ack. We definitely do need to keep the old semantics.

Looking at the patch again, I think it's that

    (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0

thing that is broken. I think splice_to_pipe is *always* nowait - but
has the special conditional _initial_ wait.

So I think the RWF_NOWAIT should be unconditional to the do_writev(),
and instead the code should do something like

        ret = wait_for_space(pipe, flags);
        if (!ret) do_writev(...RWF_NOWAIT);

but admittedly I did not think very much about the details, so I might
miss something.

Which also then probably measn that we should just keep the legacy
wrapper in fs/splice.c and we'd just need to make do_writev() and
do_readv() non-static.

Because I'd rather keep wait_for_space() internal to splice (or
alternatively we'd move it to pipe.c, rename it to
"pipe_wait_for_space()", and change the 'flags' argument to be a
boolean to not make it use that splice-specific flags etc).

            Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 5 days, 6 hours ago

Linus Torvalds <torvalds@linux-foundation.org>:
> That absolutely would be my suggested next step.
> 
> Something like the attached - get rid of filemap_splice_read()
> entirely, and just replace it with copy_splice_read().

Okay, I will post something like this soon.

But I'm slow person, and also I will test things in Qemu, so this will
take some days.

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Pedro Falcato 5 days, 8 hours ago

On Tue, Jun 02, 2026 at 03:06:07PM -0700, Linus Torvalds wrote:
> On Tue, 2 Jun 2026 at 14:37, Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > Well, that's most definitely part of my patch. Also, you cannot outright
> > remove splice() functionality
> 
> That isn't what Askar's patch ever did.
> 
> You apparently didn't even read it.
 
Well, I was replying to Askar's new idea to remove pagecache-to-pipe splice,
which is what he suggested. And directly intersects with my sysctl-to-disable-splice
patch.

> Honestly, I think you are the one out of line here.
> 
> Askar did something I suggested years ago, and didn't remove any functionality.
> 
> It just changes vmsplice to be a copying model (one of the directions
> already was). It doesn't change regular splice at all.
> 
> And yes, it has the potential to be a visible behavior difference - if
> some insane user uses vmsplice and then modifies the buffer
> *afterwards*, then that would be semantically different between a
> zero-copy and a normal copy.
> 
> But that would be insane behavior, and was never really reliable
> anyway even with zero-copy (ie subsequent writes to user space buffers
> would potentially do COW breaking based purely on timing and memory
> pressure etc, so anybody who relied on it being visible wasn't goign
> to get it realiably anyway)
> 
> Perhaps more importantly, it has the potential to change performance -
> zero-copy *can* be a performance win, although typically it really
> doesn't tend to be (looking up the page mapping is often slower than
> copying).
> 
> I would expect it to be very clear in trivial benchmarks that aren't
> actually real loads. And probably not visible anywhere else.

Yes, vmsplice() sucks, and we know it. Hopefully no one else will see the
difference. I don't think we can say the same for splice(), though.

> Trying to make it look like Askar is the problem is only making you look worse.

To be clear, I don't think Askar is the (or a) problem. I'm glad he's
contributing, and getting rid of bad kernel interfaces is always nice. I was
just a little frustrated with a parallel splice-related-unscrew patch.

(Askar, if I was too hostile, I do sincerely apologize.)

-- 
Pedro

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 5 days, 8 hours ago

Pedro Falcato <pfalcato@suse.de>:
> (Askar, if I was too hostile, I do sincerely apologize.)

You did nothing wrong.

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 1 week ago

On Sun, May 31, 2026 at 11:54 AM Pedro Falcato <pfalcato@suse.de> wrote:
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.

Yes. But I propose an alternative solution to this problem.

Brauner said in discussion for your patchset:
"So I'm not very likely to pick this up as is".
So, I decided to submit another solution.

Pedro, I'm not trying to insult you.

Other kernel developers will decide which of these two solutions they like more.

Many people in discussion of your patchset said how they
dislike splice/vmsplice, and especially vmsplice.
Hellwig said "vmsplice is the worst".
Brauner, Hellwig, Horn said that they dislike vmsplice.
They said that vmsplice in its current form should not
be used, and that it is broken.

Despite all these problems nobody managed to fix
vmsplice in all these years.
So I propose just to effectively remove it.

You may think that I just saw a recent discussion and decided
to jump in. No. splice/vmsplice is my topic of interest for many
years. You can verify this by searching "f:Askar splice"
on lore.kernel.org . I simply decided that given
recent vulnerabilities now is the perfect time to solve
all these vmsplice problems once and for all.

I explained my position here:
https://lore.kernel.org/all/20260523204100.553125-1-safinaskar@gmail.com/ .
Nobody answered, so I just posted this patchset.

If my patchset is applied, then I will try to deal
with splice-pagecache-to-pipe somehow,
probably by removing it, too. :) I decided first
to deal with vmsplice, because it seems to be
easier problem.

> > vmsplice(fd, vec, vlen, vmsplice_flags) will
> > be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
> > readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
> > writable pipe.
>
> This does not work. https://codesearch.debian.net/search?q=vmsplice%28&literal=1
> There are users.

Yes, they are. But my solution is compatible. vmsplice is simply performance
optimization. vmsplice will work just as before, but slower.
And, most importantly, vmsplice design problems will be gone
(nobody managed to fix them anyway for all these years).

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Christian Brauner 6 days, 15 hours ago

On Mon, Jun 01, 2026 at 12:21:06AM +0300, Askar Safin wrote:
> On Sun, May 31, 2026 at 11:54 AM Pedro Falcato <pfalcato@suse.de> wrote:
> > So, you took an ongoing discussion with an ongoing RFC patchset, and you
> > decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Yes. But I propose an alternative solution to this problem.

So I think this is a case where no explicit rules have been broken. But
if you know that someone has been posting patches and is working on a
problem just racing them to get your own stuff merged is very likely to
unnecessarily ruffle feathers. So sync with the person next time.

The discussion wasn't at an impasse and Pedro is expected to follow-up.
It's not very nice to just have someone else's work be for naught.

> Brauner said in discussion for your patchset:
> "So I'm not very likely to pick this up as is".
> So, I decided to submit another solution.

This lacks quite some context... I said "in its current form" and the a
long discussion ensued.

> If my patchset is applied, then I will try to deal
> with splice-pagecache-to-pipe somehow,
> probably by removing it, too. :) I decided first

So ok, but this is literally what Pedro is working on. This just wastes
people's time.

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by David Hildenbrand (Arm) 1 week ago

On 5/31/26 10:54, Pedro Falcato wrote:
> On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
>> This patchset is for VFS.
>>
>> Recently we got a lot of vulnerabilities in splice/vmsplice.
>>
>> Also vmsplice already was source of vulnerabilities in the past:
>> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
>>
>> Also vmsplice is problematic for other reasons. Here is what other
>> developers say:
>>
>> Linus Torvalds in 2023:
>>> So I'd personally be perfectly ok with just making vmsplice() be
>>> exactly the same as write, and turn all of vmsplice() into just "it's
>>> a read() if the pipe is open for read, and a write if it's open for
>>> writing".
>> https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/
>>
>> Christoph Hellwig in May 2026:
>>> vmsplice is the worst, as it is one of the few remaining places that
>>> can incorrectly dirty file backed pages without telling the file system
>>> and cause the other problems fixed by a FOLL_PIN conversion, but it is
>>> the only one where we do not have any idea yet how we could convert it
>>> to FOLL_PIN due to the unbounded pin time.
>> https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/
>>
>> See recent discussion here:
>> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> 
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Riiiiiight.... I don't think I have to NAK this, do I?

Jup. I'll just ignore this patch set here.

-- 
Cheers,

David

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Christian Brauner 6 days, 15 hours ago

On Sun, 31 May 2026 01:01:04 +0000, Askar Safin wrote:
> This patchset is for VFS.
> 
> Recently we got a lot of vulnerabilities in splice/vmsplice.
> 
> Also vmsplice already was source of vulnerabilities in the past:
> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
> 
> [...]

Applied to the vfs-7.2.vmsplice branch of the vfs/vfs.git tree.
Patches in the vfs-7.2.vmsplice branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-7.2.vmsplice

[1/3] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
      https://git.kernel.org/vfs/vfs/c/a9f7db50ed2f
[2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
      https://git.kernel.org/vfs/vfs/c/e2c0b2368081
[3/3] splice: remove PIPE_BUF_FLAG_GIFT
      https://git.kernel.org/vfs/vfs/c/7d75aa8edfce

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 6 days, 14 hours ago

On Mon, 1 Jun 2026 at 09:42, Christian Brauner <brauner@kernel.org> wrote:
>
> Applied to the vfs-7.2.vmsplice branch of the vfs/vfs.git tree.

Btw, if people want to work further on this - assuming we don't get
any huge screams of pain from having effectively gotten rid of
vmsplice() - I don't think it would hurt to look at limiting the
"regular" splice() too.

We already have the code to just turn it into a pure copy on the
"splice to pipe" case: copy_splice_read(). In many ways it would be
*lovely* to just always force that path.

We already do that explicitly for DAX and O_DIRECT, but we made a lot
of special files do it implicitly too, so quite a lot of the splice
reading cases already use that "just read() into a kernel space
buffer" model for splicing.

It would be interesting to hear who would even notice if we just
always used that copy case, and made "f_op->splice_read" never trigger
at all.

And it turns out that the only thing that ever uses
"f_op->splice_write" is splice_to_socket. Which was actually the
problematic buggy case.

Everybody else pretty much seems to just use iter_file_splice_write(),
which does the "emulate it with just a write from kernel buffers".

So *if* we get rid of f_op->splice_read, we do leave the case that
really caused problems, but nobody will ever care. Because once splice
only deals with private buffers that can't be shared with anything
else, a f_op->splice_write() that gets things wrong is pretty much a
non-event.

(We'd have to look at 'tee()' too: I don't think anybody really uses
it, but it does do the "no copy linking" by just incrementing
refcounts on the pipe buffers. So to really protect against
splice_write users messing up, that should do copies too, but as long
as it's all "private ephemeral buffers" that get their refcounts
updated, I don't think anybody *really* cares)

TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
a big simplification.

                Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Al Viro 6 days, 14 hours ago

On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:

> TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> a big simplification.

FUSE might be interesting - fuse_dev_splice_read() and its ilk.
Communications between the kernel and fuse server at least used to
seriously want that, so that would be one place to look for unhappy
userland...

splice-related logics in fs/fuse/dev.c is interesting; another place
like this is kernel/trace/, but I'm less familiar with that one.

rostedt Cc'd (miklos already had been)

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Miklos Szeredi 4 days, 21 hours ago

On Mon, 1 Jun 2026 at 19:33, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
>
> > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > a big simplification.
>
> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> Communications between the kernel and fuse server at least used to
> seriously want that, so that would be one place to look for unhappy
> userland...
>
> splice-related logics in fs/fuse/dev.c is interesting; another place
> like this is kernel/trace/, but I'm less familiar with that one.

[Cc: Joanne, fuse-devel]

I'd favor simplification, but care is needed to not regress performance.

Joanne might be in a better position to say something about relative
performance of various transport modes in fuse.

Thanks,
Miklos

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Steven Rostedt 6 days, 11 hours ago

On Mon, 1 Jun 2026 18:33:25 +0100
Al Viro <viro@zeniv.linux.org.uk> wrote:

> On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> 
> > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > a big simplification.  
> 
> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> Communications between the kernel and fuse server at least used to
> seriously want that, so that would be one place to look for unhappy
> userland...
> 
> splice-related logics in fs/fuse/dev.c is interesting; another place
> like this is kernel/trace/, but I'm less familiar with that one.
> 
> rostedt Cc'd (miklos already had been)

Thanks for the Cc. The tracing ring buffer was specifically made to be used
by splice and the libtracefs has a lot of code to use it as well. As
reading the ring buffer literally swaps out the write portion with a blank
read portion, that portion (sub-buffer) is used to be directly fed into
splice, providing a zero-copy of the trace data from the write of the event
to going into a file.

trace-cmd defaults to using splice to copy the tracing ring buffer directly
into files to avoid as much copying during live recordings as possible.

Whatever changes we make, I would like to make sure there's no regressions
in performance of trace-cmd record.

Thanks,

-- Steve

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andrew Morton 6 days, 7 hours ago

On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 1 Jun 2026 18:33:25 +0100
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > 
> > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > a big simplification.  
> > 
> > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > Communications between the kernel and fuse server at least used to
> > seriously want that, so that would be one place to look for unhappy
> > userland...
> > 
> > splice-related logics in fs/fuse/dev.c is interesting; another place
> > like this is kernel/trace/, but I'm less familiar with that one.
> > 
> > rostedt Cc'd (miklos already had been)
> 
> Thanks for the Cc. The tracing ring buffer was specifically made to be used
> by splice and the libtracefs has a lot of code to use it as well. As
> reading the ring buffer literally swaps out the write portion with a blank
> read portion, that portion (sub-buffer) is used to be directly fed into
> splice, providing a zero-copy of the trace data from the write of the event
> to going into a file.
> 
> trace-cmd defaults to using splice to copy the tracing ring buffer directly
> into files to avoid as much copying during live recordings as possible.
> 
> Whatever changes we make, I would like to make sure there's no regressions
> in performance of trace-cmd record.

Well yes, The patchset seems sensible from a quality POV.  But to make
a decision we should first have a decent understanding of its downside
impact.

I haven't seen a description of that impact in the discussion thus far.
And that description is owed, please.

I assume a small number of specialized applications are using
vmsplice() to great effect?  What are those applications?  What is the
impact of this change?

Once we are armed with that information, is there some middle ground in
which we de-feature vmsplice()?  Fall back to pread/pwrite in the
tricky cases and still permit vmsplicing if the application is
appropriately restrictive in it usage?

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Willy Tarreau 4 days, 1 hour ago

On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Mon, 1 Jun 2026 18:33:25 +0100
> > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > 
> > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > 
> > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > a big simplification.  
> > > 
> > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > Communications between the kernel and fuse server at least used to
> > > seriously want that, so that would be one place to look for unhappy
> > > userland...
> > > 
> > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > like this is kernel/trace/, but I'm less familiar with that one.
> > > 
> > > rostedt Cc'd (miklos already had been)
> > 
> > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > by splice and the libtracefs has a lot of code to use it as well. As
> > reading the ring buffer literally swaps out the write portion with a blank
> > read portion, that portion (sub-buffer) is used to be directly fed into
> > splice, providing a zero-copy of the trace data from the write of the event
> > to going into a file.
> > 
> > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > into files to avoid as much copying during live recordings as possible.
> > 
> > Whatever changes we make, I would like to make sure there's no regressions
> > in performance of trace-cmd record.
> 
> Well yes, The patchset seems sensible from a quality POV.  But to make
> a decision we should first have a decent understanding of its downside
> impact.
> 
> I haven't seen a description of that impact in the discussion thus far.
> And that description is owed, please.
> 
> I assume a small number of specialized applications are using
> vmsplice() to great effect?  What are those applications?  What is the
> impact of this change?

> Once we are armed with that information, is there some middle ground in
> which we de-feature vmsplice()?  Fall back to pread/pwrite in the
> tricky cases and still permit vmsplicing if the application is
> appropriately restrictive in it usage?

I'm using vmsplice() + tee() + splice() in high-performance applications,
load generators to be precise, and soon a cache. This is super convenient
and extremely efficient:

  - vmsplice() is used to prepare a "master" pipe with data to be sent
    over TCP or kTLS
  - then for each request, we do tee() from this master pipe to per-request
    pipes.
  - the per-request pipes are those that are used to deliver the data to
    the socket via splice().

So we effectively use vmsplice(), tee() and splice() here, and for exactly
the reasons they were designed: only play with page refcount and not copy
data. The code is here for the curious:

   https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c

and its ancestor is here:

   https://github.com/wtarreau/httpterm/blob/master/httpterm.c

It simply doubles the network bandwidth compared to not using that.
(62 Gbps per core vs 31). I would seriously miss it if I couldn't use
this anymore.

I also have mid-term plans for using vmsplice() to deliver contents from
a cache to sockets as well via splice(). Right now our cache is split into
too small chunks (1kB) to make that useful, but as soon as we can move to
4kB pages, it will make sense. There the same gains are expected, and I
would particularly dislike the idea of no longer being able to implement
zero-copy!

Maybe some arrangements are possible though. I'm not seeing any other way
to achieve the same things differently, but possibly that the base of the
problem is the easy abuse of vmsplice() to affect the page cache. Maybe
placing certain restrictions such as he area only being mapped to anonymous
pages, or anything similar could make sense. In my use case it wouldn't be
that much of a constraint. Well, for the cache maybe it could be though,
as it would prevent us from sharing it via persistent storage. Or maybe
we could require a CAP_BACKED_VMSPLICE to be allowed to vmsplice file-
backed pages, which could be sufficient to prevent easy LPE each time a
bug is found ?

I think that the users of this APIs are rare enough that we can probably
find a solution that anyone can reasonably adapt to with minimal
constraints. But most likely each of these few users rely on this
*a lot*.

Just my two cents,
Willy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 3 days, 15 hours ago

On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau <w@1wt.eu> wrote:
>
> On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > > On Mon, 1 Jun 2026 18:33:25 +0100
> > > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > >
> > > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > > a big simplification.
> > > >
> > > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > > Communications between the kernel and fuse server at least used to
> > > > seriously want that, so that would be one place to look for unhappy
> > > > userland...
> > > >
> > > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > > like this is kernel/trace/, but I'm less familiar with that one.
> > > >
> > > > rostedt Cc'd (miklos already had been)
> > >
> > > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > > by splice and the libtracefs has a lot of code to use it as well. As
> > > reading the ring buffer literally swaps out the write portion with a blank
> > > read portion, that portion (sub-buffer) is used to be directly fed into
> > > splice, providing a zero-copy of the trace data from the write of the event
> > > to going into a file.
> > >
> > > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > > into files to avoid as much copying during live recordings as possible.
> > >
> > > Whatever changes we make, I would like to make sure there's no regressions
> > > in performance of trace-cmd record.
> >
> > Well yes, The patchset seems sensible from a quality POV.  But to make
> > a decision we should first have a decent understanding of its downside
> > impact.
> >
> > I haven't seen a description of that impact in the discussion thus far.
> > And that description is owed, please.
> >
> > I assume a small number of specialized applications are using
> > vmsplice() to great effect?  What are those applications?  What is the
> > impact of this change?
>
> > Once we are armed with that information, is there some middle ground in
> > which we de-feature vmsplice()?  Fall back to pread/pwrite in the
> > tricky cases and still permit vmsplicing if the application is
> > appropriately restrictive in it usage?
>
> I'm using vmsplice() + tee() + splice() in high-performance applications,
> load generators to be precise, and soon a cache. This is super convenient
> and extremely efficient:
>
>   - vmsplice() is used to prepare a "master" pipe with data to be sent
>     over TCP or kTLS
>   - then for each request, we do tee() from this master pipe to per-request
>     pipes.
>   - the per-request pipes are those that are used to deliver the data to
>     the socket via splice().
>
> So we effectively use vmsplice(), tee() and splice() here, and for exactly
> the reasons they were designed: only play with page refcount and not copy
> data. The code is here for the curious:
>
>    https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c
>
> and its ancestor is here:
>
>    https://github.com/wtarreau/httpterm/blob/master/httpterm.c
>
> It simply doubles the network bandwidth compared to not using that.
> (62 Gbps per core vs 31). I would seriously miss it if I couldn't use
> this anymore.
>

Wait a moment.  This is neat, but it's literally just a benchmark,
right?  I skimmed the code, and it doesn't look like a production
workload, either.  And you manage to get around the awfulness of the
vmsplice API's complete failure to tell you when it's done with a
buffer by ... never actually changing the contents of the buffer.  Do
you have any idea how you would write correct code that uses vmsplice
for sends and then *ever* mutates the data without literally
munmapping (or madvise or something) the data do you can safely mutate
it?

> I also have mid-term plans for using vmsplice() to deliver contents from
> a cache to sockets as well via splice(). Right now our cache is split into
> too small chunks (1kB) to make that useful, but as soon as we can move to
> 4kB pages, it will make sense. There the same gains are expected, and I
> would particularly dislike the idea of no longer being able to implement
> zero-copy!

If I'm understanding you correctly, you see (and measured!) a
performance improvement, and you would like to use it in production.

It seems to me that this is an excellent opportunity to remember that
vmsplice gets a performance boost in a highly synthetic situation that
sort of resembles a cache scenario and then to deprecate vmsplice and
build something better!  Or discover that we already have something
better, perhaps :)

https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html

I see that this can submit a buffer without a syscall (tee + splice is
*two* syscalls!) and that it has directly addressed what I see as the
really big deficiency in vmsplice: "This second notification tells the
application that the memory associated with the send is safe to get
reused."  If I were writing the user code, I would very much want that
notification to be an explicit part of the API instead of making a
wild guess as I think I would need to do with vmsplice.

--Andy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Willy Tarreau 3 days, 15 hours ago

On Thu, Jun 04, 2026 at 08:53:15AM -0700, Andy Lutomirski wrote:
> On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau <w@1wt.eu> wrote:
> >
> > On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> > > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> > >
> > > > On Mon, 1 Jun 2026 18:33:25 +0100
> > > > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > > >
> > > > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > > > a big simplification.
> > > > >
> > > > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > > > Communications between the kernel and fuse server at least used to
> > > > > seriously want that, so that would be one place to look for unhappy
> > > > > userland...
> > > > >
> > > > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > > > like this is kernel/trace/, but I'm less familiar with that one.
> > > > >
> > > > > rostedt Cc'd (miklos already had been)
> > > >
> > > > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > > > by splice and the libtracefs has a lot of code to use it as well. As
> > > > reading the ring buffer literally swaps out the write portion with a blank
> > > > read portion, that portion (sub-buffer) is used to be directly fed into
> > > > splice, providing a zero-copy of the trace data from the write of the event
> > > > to going into a file.
> > > >
> > > > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > > > into files to avoid as much copying during live recordings as possible.
> > > >
> > > > Whatever changes we make, I would like to make sure there's no regressions
> > > > in performance of trace-cmd record.
> > >
> > > Well yes, The patchset seems sensible from a quality POV.  But to make
> > > a decision we should first have a decent understanding of its downside
> > > impact.
> > >
> > > I haven't seen a description of that impact in the discussion thus far.
> > > And that description is owed, please.
> > >
> > > I assume a small number of specialized applications are using
> > > vmsplice() to great effect?  What are those applications?  What is the
> > > impact of this change?
> >
> > > Once we are armed with that information, is there some middle ground in
> > > which we de-feature vmsplice()?  Fall back to pread/pwrite in the
> > > tricky cases and still permit vmsplicing if the application is
> > > appropriately restrictive in it usage?
> >
> > I'm using vmsplice() + tee() + splice() in high-performance applications,
> > load generators to be precise, and soon a cache. This is super convenient
> > and extremely efficient:
> >
> >   - vmsplice() is used to prepare a "master" pipe with data to be sent
> >     over TCP or kTLS
> >   - then for each request, we do tee() from this master pipe to per-request
> >     pipes.
> >   - the per-request pipes are those that are used to deliver the data to
> >     the socket via splice().
> >
> > So we effectively use vmsplice(), tee() and splice() here, and for exactly
> > the reasons they were designed: only play with page refcount and not copy
> > data. The code is here for the curious:
> >
> >    https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c
> >
> > and its ancestor is here:
> >
> >    https://github.com/wtarreau/httpterm/blob/master/httpterm.c
> >
> > It simply doubles the network bandwidth compared to not using that.
> > (62 Gbps per core vs 31). I would seriously miss it if I couldn't use
> > this anymore.
> >
> 
> Wait a moment.  This is neat, but it's literally just a benchmark,
> right?

No, it's a benchmark *tool*: it's being used to stress production code,
which is important and super hard at high loads. You place it after your
proxy and you measure the performance of the proxy (which is supposed not
to be as capable as the testing tools otherwise the methodology revolves
to testing the testing tools, which is not the point).

> I skimmed the code, and it doesn't look like a production
> workload, either.  And you manage to get around the awfulness of the
> vmsplice API's complete failure to tell you when it's done with a
> buffer by ... never actually changing the contents of the buffer.  Do
> you have any idea how you would write correct code that uses vmsplice
> for sends and then *ever* mutates the data without literally
> munmapping (or madvise or something) the data do you can safely mutate
> it?

I'm not sure what you mean here Andy. I *do not* need to change the
data, it's just a pre-made pattern.

> > I also have mid-term plans for using vmsplice() to deliver contents from
> > a cache to sockets as well via splice(). Right now our cache is split into
> > too small chunks (1kB) to make that useful, but as soon as we can move to
> > 4kB pages, it will make sense. There the same gains are expected, and I
> > would particularly dislike the idea of no longer being able to implement
> > zero-copy!
> 
> If I'm understanding you correctly, you see (and measured!) a
> performance improvement, and you would like to use it in production.

The prod for the tool is to be used to benchmark other tools. It does
the job quite well. It's even more important when you use kTLS-enabled
hardware where you can get zero-copy all along the line and delegate
the crypto to the hardware. That's the beauty of all the nice work that
was done in the stack along all these years. That code started to be
used in clear maybe 15 years ago or so, but nowadays the gains are even
more interesting.

> It seems to me that this is an excellent opportunity to remember that
> vmsplice gets a performance boost in a highly synthetic situation that
> sort of resembles a cache scenario and then to deprecate vmsplice and
> build something better!

I've definitely been keeping vmsplice() on my radar for our cache,
and we've progressively implemented various architectural updates in
haproxy precisely for this.

> Or discover that we already have something better, perhaps :)
> 
> https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html

io_uring is different. We tried it "the dirty way" in the past, by
emulating a poller, and it's not worth it this way. And in order to
do it the right way, it needs to be done totally differently, which
has impacts all over the stack. The code in the file pointed to above
is just for the httpterm testing feature, but the rest is much more
complex.

> I see that this can submit a buffer without a syscall (tee + splice is
> *two* syscalls!) and that it has directly addressed what I see as the
> really big deficiency in vmsplice: "This second notification tells the
> application that the memory associated with the send is safe to get
> reused."  If I were writing the user code, I would very much want that
> notification to be an explicit part of the API instead of making a
> wild guess as I think I would need to do with vmsplice.

I agree, for the cache it's something important (not for the load
generator). But IIRC that's something you can also check via SIOCOUTQ
which is normally sufficient for a cache's eviction system (though not
fantastic).

Willy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 3 days, 14 hours ago

On Thu, Jun 4, 2026 at 9:09 AM Willy Tarreau <w@1wt.eu> wrote:
>
> On Thu, Jun 04, 2026 at 08:53:15AM -0700, Andy Lutomirski wrote:
> > On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau <w@1wt.eu> wrote:
> > >
> > > On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> > > > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> > > >
> > > > > On Mon, 1 Jun 2026 18:33:25 +0100
> > > > > Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > >
> > > > > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > > > >
> > > > > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > > > > a big simplification.
> > > > > >
> > > > > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > > > > Communications between the kernel and fuse server at least used to
> > > > > > seriously want that, so that would be one place to look for unhappy
> > > > > > userland...
> > > > > >
> > > > > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > > > > like this is kernel/trace/, but I'm less familiar with that one.
> > > > > >
> > > > > > rostedt Cc'd (miklos already had been)
> > > > >
> > > > > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > > > > by splice and the libtracefs has a lot of code to use it as well. As
> > > > > reading the ring buffer literally swaps out the write portion with a blank
> > > > > read portion, that portion (sub-buffer) is used to be directly fed into
> > > > > splice, providing a zero-copy of the trace data from the write of the event
> > > > > to going into a file.
> > > > >
> > > > > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > > > > into files to avoid as much copying during live recordings as possible.
> > > > >
> > > > > Whatever changes we make, I would like to make sure there's no regressions
> > > > > in performance of trace-cmd record.
> > > >
> > > > Well yes, The patchset seems sensible from a quality POV.  But to make
> > > > a decision we should first have a decent understanding of its downside
> > > > impact.
> > > >
> > > > I haven't seen a description of that impact in the discussion thus far.
> > > > And that description is owed, please.
> > > >
> > > > I assume a small number of specialized applications are using
> > > > vmsplice() to great effect?  What are those applications?  What is the
> > > > impact of this change?
> > >
> > > > Once we are armed with that information, is there some middle ground in
> > > > which we de-feature vmsplice()?  Fall back to pread/pwrite in the
> > > > tricky cases and still permit vmsplicing if the application is
> > > > appropriately restrictive in it usage?
> > >
> > > I'm using vmsplice() + tee() + splice() in high-performance applications,
> > > load generators to be precise, and soon a cache. This is super convenient
> > > and extremely efficient:
> > >
> > >   - vmsplice() is used to prepare a "master" pipe with data to be sent
> > >     over TCP or kTLS
> > >   - then for each request, we do tee() from this master pipe to per-request
> > >     pipes.
> > >   - the per-request pipes are those that are used to deliver the data to
> > >     the socket via splice().
> > >
> > > So we effectively use vmsplice(), tee() and splice() here, and for exactly
> > > the reasons they were designed: only play with page refcount and not copy
> > > data. The code is here for the curious:
> > >
> > >    https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c
> > >
> > > and its ancestor is here:
> > >
> > >    https://github.com/wtarreau/httpterm/blob/master/httpterm.c
> > >
> > > It simply doubles the network bandwidth compared to not using that.
> > > (62 Gbps per core vs 31). I would seriously miss it if I couldn't use
> > > this anymore.
> > >
> >
> > Wait a moment.  This is neat, but it's literally just a benchmark,
> > right?
>
> No, it's a benchmark *tool*: it's being used to stress production code,
> which is important and super hard at high loads. You place it after your
> proxy and you measure the performance of the proxy (which is supposed not
> to be as capable as the testing tools otherwise the methodology revolves
> to testing the testing tools, which is not the point).
>
> > I skimmed the code, and it doesn't look like a production
> > workload, either.  And you manage to get around the awfulness of the
> > vmsplice API's complete failure to tell you when it's done with a
> > buffer by ... never actually changing the contents of the buffer.  Do
> > you have any idea how you would write correct code that uses vmsplice
> > for sends and then *ever* mutates the data without literally
> > munmapping (or madvise or something) the data do you can safely mutate
> > it?
>
> I'm not sure what you mean here Andy. I *do not* need to change the
> data, it's just a pre-made pattern.

What I mean is: this particular pattern seems limited for use in an
actual webserver as a opposed to a load-tester.

> > Or discover that we already have something better, perhaps :)
> >
> > https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html
>
> io_uring is different. We tried it "the dirty way" in the past, by
> emulating a poller, and it's not worth it this way. And in order to
> do it the right way, it needs to be done totally differently, which
> has impacts all over the stack. The code in the file pointed to above
> is just for the httpterm testing feature, but the rest is much more
> complex.

I'm curious how this kludge does:

https://github.com/amluto/zc_bench

I vibe-coded this up without much care, and I don't have the hardware
needed to actually run it in an interesting manner.  But, on a Linux
VM on an Apple M4, I can push about 130Gbps on a single core over
loopback.  In theory this will do zerocopy sends (but not over
loopback), and I would hope that it runs *faster* than vmsplice + tee.

(I have a fancy workstation that can do a whopping 2.5Gbps.  I could
probably jury-rig a test over Thunderbolt at higher speeds.  I have
systems that are not available for this test right now that can do
10Gbps.  But someone probably needs 40Gbps or better hardware for a
genuinely interesting test.)

-- 
Andy Lutomirski
AMA Capital Management, LLC

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 3 days, 17 hours ago

On Wed, 3 Jun 2026 at 23:32, Willy Tarreau <w@1wt.eu> wrote:
>
> I'm using vmsplice() + tee() + splice() in high-performance applications,
> load generators to be precise, and soon a cache. This is super convenient
> and extremely efficient:
>
>   - vmsplice() is used to prepare a "master" pipe with data to be sent
>     over TCP or kTLS
>   - then for each request, we do tee() from this master pipe to per-request
>     pipes.
>   - the per-request pipes are those that are used to deliver the data to
>     the socket via splice().

So most of those would actually not be affected by any of the existing
patches: the pipe->socket splice would remain, the tee() code would
still just take a ref to the page count.

The vmsplice() would change, but looking at your haterm.c sources, it
looks like it's mostly a fairly small thing ("common_response[]" being
16kB).

That is typically *faster* to just copy than look up pages.

HOWEVER.

It looks like you're actually doing exactly the thing that I thought
was crazy and wouldn't even work reliably: you change the
common_response[] contents dynamically *after* the vmsplice, and
depend on the fact that changing it in user space changes the buffer
in the pipe too.

So that would break *entirely* with the vmsplice() changes if I read
the code right (which I might not do) simply because that looks like
it really does require that "wrutably shared buffer after the fact".

Interesting.  Because the vmsplice() code uses get_user_pages_fast(),
and honestly, it never pinned the page reliably to the original source
- it breaks COW randomly in one direction or the other after fork()
(and I thouht even after a page-out, but thinking more about it the
swap cache may have made it work for that case).

Uhhuh. That does look like it makes the vmsplice() changes untenable.

But I may be reading your haproxy code entirely wrong.

               Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Willy Tarreau 3 days, 15 hours ago

On Thu, Jun 04, 2026 at 07:31:30AM -0700, Linus Torvalds wrote:
> On Wed, 3 Jun 2026 at 23:32, Willy Tarreau <w@1wt.eu> wrote:
> >
> > I'm using vmsplice() + tee() + splice() in high-performance applications,
> > load generators to be precise, and soon a cache. This is super convenient
> > and extremely efficient:
> >
> >   - vmsplice() is used to prepare a "master" pipe with data to be sent
> >     over TCP or kTLS
> >   - then for each request, we do tee() from this master pipe to per-request
> >     pipes.
> >   - the per-request pipes are those that are used to deliver the data to
> >     the socket via splice().
> 
> So most of those would actually not be affected by any of the existing
> patches: the pipe->socket splice would remain, the tee() code would
> still just take a ref to the page count.

OK!

> The vmsplice() would change,

OK but for this use case it's not dramatic (it could be more annyoing
for the cache where I'd like this zero-copy from memory to the wire
though).

> but looking at your haterm.c sources, it
> looks like it's mostly a fairly small thing ("common_response[]" being
> 16kB).

In this one it's indeed a 16kB block that is repeated into the
same pipe by simplicity, in its ancestor it was 64kB. We try to
make as large a pipe as we can, but that's all.

> That is typically *faster* to just copy than look up pages.
> 
> HOWEVER.
> 
> It looks like you're actually doing exactly the thing that I thought
> was crazy and wouldn't even work reliably: you change the
> common_response[] contents dynamically *after* the vmsplice, and
> depend on the fact that changing it in user space changes the buffer
> in the pipe too.

No no, it's definitely not doing that (or it's a bug, but it's not
supposed to happen). I'm perfectly aware that one must definitely not
do that, and it's a guarantee the user of vmsplice() must provide.

> So that would break *entirely* with the vmsplice() changes if I read
> the code right (which I might not do) simply because that looks like
> it really does require that "wrutably shared buffer after the fact".

We agree that this would deliver complete garbage an I'm not interested
in such a "feature" at all.

> Interesting.  Because the vmsplice() code uses get_user_pages_fast(),
> and honestly, it never pinned the page reliably to the original source
> - it breaks COW randomly in one direction or the other after fork()

I must confess I never knew how it deals with pages shared over a
fork(), and have been wondering if two processes could create a
shared memory area on the fly just by using vmsplice() on each side
and end up with the same pages (I don't need this but it could have
very nice use cases).

> (and I thouht even after a page-out, but thinking more about it the
> swap cache may have made it work for that case).
> 
> Uhhuh. That does look like it makes the vmsplice() changes untenable.

No no don't worry, I'm not seeing any value in changing data after
vmsplice() and that would just be a bug. My goal here is only to
pre-fill a buffer with a pattern then prepare the pipe with that
pattern, nothing less, nothing more.

> But I may be reading your haproxy code entirely wrong.

I think so, but I wouldn't be the one blaming you for this ;-)

Thanks for the clarifications!
Willy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 3 days, 15 hours ago

On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
>
> > It looks like you're actually doing exactly the thing that I thought
> > was crazy and wouldn't even work reliably: you change the
> > common_response[] contents dynamically *after* the vmsplice, and
> > depend on the fact that changing it in user space changes the buffer
> > in the pipe too.
>
> No no, it's definitely not doing that (or it's a bug, but it's not
> supposed to happen). I'm perfectly aware that one must definitely not
> do that, and it's a guarantee the user of vmsplice() must provide.

Whew, good.

In that case, can you just try the vmsplice patch series (Christian
already found a bug, but I don't think it will necessarily matter in
practice - famous last words) and that test patch of mine, and see if
it all (a) works for you and (b) if you have any numbers for
performance that would be *great*.

There aren't many obvious splice users out there, and even if they
were to exist they are typically specialized enough that you have to
have a real use case to then tell if the patches make a difference in
real life or not.

So you testing that thing would seem to be a great first test of
whether any of this is realistic..

               Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by The 8472 2 days, 10 hours ago

On 04/06/2026 17:58, Linus Torvalds wrote:
> On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
>>
>>> It looks like you're actually doing exactly the thing that I thought
>>> was crazy and wouldn't even work reliably: you change the
>>> common_response[] contents dynamically *after* the vmsplice, and
>>> depend on the fact that changing it in user space changes the buffer
>>> in the pipe too.
>>
>> No no, it's definitely not doing that (or it's a bug, but it's not
>> supposed to happen). I'm perfectly aware that one must definitely not
>> do that, and it's a guarantee the user of vmsplice() must provide.
> 
> Whew, good.
> 
> In that case, can you just try the vmsplice patch series (Christian
> already found a bug, but I don't think it will necessarily matter in
> practice - famous last words) and that test patch of mine, and see if
> it all (a) works for you and (b) if you have any numbers for
> performance that would be *great*.
> 
> There aren't many obvious splice users out there, and even if they
> were to exist they are typically specialized enough that you have to
> have a real use case to then tell if the patches make a difference in
> real life or not.

In the Rust standard library we use splice as one of several strategies
in our generic io::copy[0] routine. It selects the strategy[1] based on
source and sink types.

It tries

- copy_file_range
- sendfile
- splice
- fallback to userspace read-write loop

sendfile or splice are skipped when we can't uphold the "callers must ensure
transferred portions in_fd remain unmodified" condition on the manpage,
which unfortunately includes some particularly desirable combinations of
sinks and sources (such as mutable files -> socket).

We primarily want this for reflink copies and to avoid the syscall
overhead of a read-write loop with a small stack buffer.

Any additional zerocopy benefit, when it doesn't lead to unstable data, is
welcome but not critical. E.g. it'd be nice if sendfile could do the following:
For a 1MB source and a socket with a 64kB sendbuffer it could zerocopy first ~900kB
safely and then memcpy the last 64kB to ensure it can't be modified after the
syscall returns. But a "just memcpy in kernel space instead of zerocopy" flag for
sendfile would be ok too.

We're currently not making use of vmsplice. In theory we'd like to use it for
copying from `&'static [u8]` sources since the type upholds the requirements of
vmsplice, but type specialization currently is not powerful enough to
select based on this lifetime and it's unclear if it'll ever be.

[0] https://doc.rust-lang.org/nightly/std/io/fn.copy.html
[1] https://github.com/rust-lang/rust/blob/ac6f3a3e778a586854bdbf8f15202e11e2348d9f/library/std/src/sys/io/kernel_copy/linux.rs#L210-L259

> 
> So you testing that thing would seem to be a great first test of
> whether any of this is realistic..
> 
>                 Linus
>

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Willy Tarreau 3 days, 15 hours ago

On Thu, Jun 04, 2026 at 08:58:33AM -0700, Linus Torvalds wrote:
> On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
> >
> > > It looks like you're actually doing exactly the thing that I thought
> > > was crazy and wouldn't even work reliably: you change the
> > > common_response[] contents dynamically *after* the vmsplice, and
> > > depend on the fact that changing it in user space changes the buffer
> > > in the pipe too.
> >
> > No no, it's definitely not doing that (or it's a bug, but it's not
> > supposed to happen). I'm perfectly aware that one must definitely not
> > do that, and it's a guarantee the user of vmsplice() must provide.
> 
> Whew, good.
> 
> In that case, can you just try the vmsplice patch series (Christian
> already found a bug, but I don't think it will necessarily matter in
> practice - famous last words) and that test patch of mine, and see if
> it all (a) works for you and (b) if you have any numbers for
> performance that would be *great*.

Yes I wanted to do that and noted it on my todo list yesterday when
noticing the ongoing discussion. Just been super busy with yesterday's
by-yearly release ;-) But at least I wanted to share quick feedback in
this thread about existing uses.

> There aren't many obvious splice users out there, and even if they
> were to exist they are typically specialized enough that you have to
> have a real use case to then tell if the patches make a difference in
> real life or not.

I totally agree, that's why I want to share some feedback. I remember
years ago when splice() was broken in 2.6.25, there were so few users
that I was the one reporting an API issue to Eric who addressed it
early by lack of users. And I even consider that due to the very few
users, it's even acceptable to slightly change the way to use it if
it can provide extra guarantees (like requiring a capability to access
non-anonymous pages for example). It should not break that many apps,
and as long as they can preserve their essential benefits, I think
most will be OK to adapt.

> So you testing that thing would seem to be a great first test of
> whether any of this is realistic..

Absolutely! 

Willy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Willy Tarreau 2 days, 15 hours ago

On Thu, Jun 04, 2026 at 06:15:41PM +0200, Willy Tarreau wrote:
> On Thu, Jun 04, 2026 at 08:58:33AM -0700, Linus Torvalds wrote:
> > On Thu, 4 Jun 2026 at 08:53, Willy Tarreau <w@1wt.eu> wrote:
> > >
> > > > It looks like you're actually doing exactly the thing that I thought
> > > > was crazy and wouldn't even work reliably: you change the
> > > > common_response[] contents dynamically *after* the vmsplice, and
> > > > depend on the fact that changing it in user space changes the buffer
> > > > in the pipe too.
> > >
> > > No no, it's definitely not doing that (or it's a bug, but it's not
> > > supposed to happen). I'm perfectly aware that one must definitely not
> > > do that, and it's a guarantee the user of vmsplice() must provide.
> > 
> > Whew, good.
> > 
> > In that case, can you just try the vmsplice patch series (Christian
> > already found a bug, but I don't think it will necessarily matter in
> > practice - famous last words) and that test patch of mine, and see if
> > it all (a) works for you and (b) if you have any numbers for
> > performance that would be *great*.
> 
> Yes I wanted to do that and noted it on my todo list yesterday when
> noticing the ongoing discussion. Just been super busy with yesterday's
> by-yearly release ;-) But at least I wanted to share quick feedback in
> this thread about existing uses.

OK so I could run the test this afternoon, with:
  - ddd664bbff63 Merge tag 'net-7.1-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
    (v7.1-rc6-178)

  - the same with Christian's vfs-7.2.vmsplice branch merged into it
    ( 8d86fcfc2857 include/linux/splice.h: trivial fix: declerations -> declarations)

Both show 71-72 Gbps of TLS traffic per core on my test utility (I
stopped at 3 cores since having only 2x100G at the moment), so for
this use case I'm not impacted by the change. I noted that I will
have to reconsider other options for the cache (send(MSG_ZEROCOPY)
probably) but in my case since the code doesn't exist yet it's not
per-se a userland breakage, but a change of plans. I just hope I'll
find my way through the alternate solution.

FWIW for Christian's branch:

Tested-by: Willy Tarreau <w@1wt.eu>

Willy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by David Hildenbrand (Arm) 5 days, 23 hours ago

On 6/2/26 02:28, Andrew Morton wrote:
> On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> 
>> On Mon, 1 Jun 2026 18:33:25 +0100
>> Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>>>
>>>
>>> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
>>> Communications between the kernel and fuse server at least used to
>>> seriously want that, so that would be one place to look for unhappy
>>> userland...
>>>
>>> splice-related logics in fs/fuse/dev.c is interesting; another place
>>> like this is kernel/trace/, but I'm less familiar with that one.
>>>
>>> rostedt Cc'd (miklos already had been)
>>
>> Thanks for the Cc. The tracing ring buffer was specifically made to be used
>> by splice and the libtracefs has a lot of code to use it as well. As
>> reading the ring buffer literally swaps out the write portion with a blank
>> read portion, that portion (sub-buffer) is used to be directly fed into
>> splice, providing a zero-copy of the trace data from the write of the event
>> to going into a file.
>>
>> trace-cmd defaults to using splice to copy the tracing ring buffer directly
>> into files to avoid as much copying during live recordings as possible.
>>
>> Whatever changes we make, I would like to make sure there's no regressions
>> in performance of trace-cmd record.
> 
> Well yes, The patchset seems sensible from a quality POV.  But to make
> a decision we should first have a decent understanding of its downside
> impact.

I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
entirely is certainly very appealing ...

> 
> I haven't seen a description of that impact in the discussion thus far.
> And that description is owed, please.
> 
> I assume a small number of specialized applications are using
> vmsplice() to great effect?  What are those applications?  What is the
> impact of this change?

I did some digging, and the kernel crypto API documents using splice/vmsplice
for zero-copy[1] and libkcapi [2].

I did not find performance numbers, how much vmsplice/splice actually gives us.
Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
doesn't really reveal a big difference at least on my notebook. Not sure if the
parameters I specify are reasonable.

I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
significantly worse than sendmsg ... and I don't know what the default would
usually be (default to vmsplice or sendmsg). I might try finding some time to
play with it more, but I doubt it, so if anybody else has time ... :)

I'll note that we have a bunch of selftests (mostly around COW handling) that
rely on vmsplice to test R/O pinning behavior. For R/W pinning, we can use
iouring fixed buffers easily. The only alternative for R/O pinning is using the
gup_test infrastructure that needs to be compiled into the kernel, unfortunately ...

So we'll have to adjust some tests there to use a different interface. I'm sure
I can find someone to work on that once this change here landed and doesn't have
to be yanked immediately again.

[1] https://www.kernel.org/doc/html/latest/crypto/userspace-if.html
[2] https://github.com/smuellerDD/libkcapi/blob/master/lib/kcapi-kernel-if.c
[3] https://github.com/smuellerDD/libkcapi/tree/master/speed-test

-- 
Cheers,

David

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Eric Biggers 5 days, 12 hours ago

On Tue, Jun 02, 2026 at 10:25:06AM +0200, David Hildenbrand (Arm) wrote:
> On 6/2/26 02:28, Andrew Morton wrote:
> > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> >> On Mon, 1 Jun 2026 18:33:25 +0100
> >> Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>
> >>>
> >>>
> >>> FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> >>> Communications between the kernel and fuse server at least used to
> >>> seriously want that, so that would be one place to look for unhappy
> >>> userland...
> >>>
> >>> splice-related logics in fs/fuse/dev.c is interesting; another place
> >>> like this is kernel/trace/, but I'm less familiar with that one.
> >>>
> >>> rostedt Cc'd (miklos already had been)
> >>
> >> Thanks for the Cc. The tracing ring buffer was specifically made to be used
> >> by splice and the libtracefs has a lot of code to use it as well. As
> >> reading the ring buffer literally swaps out the write portion with a blank
> >> read portion, that portion (sub-buffer) is used to be directly fed into
> >> splice, providing a zero-copy of the trace data from the write of the event
> >> to going into a file.
> >>
> >> trace-cmd defaults to using splice to copy the tracing ring buffer directly
> >> into files to avoid as much copying during live recordings as possible.
> >>
> >> Whatever changes we make, I would like to make sure there's no regressions
> >> in performance of trace-cmd record.
> > 
> > Well yes, The patchset seems sensible from a quality POV.  But to make
> > a decision we should first have a decent understanding of its downside
> > impact.
> 
> I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
> entirely is certainly very appealing ...
> 
> > 
> > I haven't seen a description of that impact in the discussion thus far.
> > And that description is owed, please.
> > 
> > I assume a small number of specialized applications are using
> > vmsplice() to great effect?  What are those applications?  What is the
> > impact of this change?
> 
> 
> I did some digging, and the kernel crypto API documents using splice/vmsplice
> for zero-copy[1] and libkcapi [2].
> 
> I did not find performance numbers, how much vmsplice/splice actually gives us.
> Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
> doesn't really reveal a big difference at least on my notebook. Not sure if the
> parameters I specify are reasonable.
> 
> I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
> significantly worse than sendmsg ... and I don't know what the default would
> usually be (default to vmsplice or sendmsg). I might try finding some time to
> play with it more, but I doubt it, so if anybody else has time ... :)

AF_ALG is a mistake and isn't commonly used.  Using a userspace crypto
library is faster and is what almost everyone does anyway, as it avoids
the syscall overhead.  There are many other issues with AF_ALG as well.

7.2 will mark AF_ALG as deprecated, mostly remove AF_ALG's zero-copy
support, and remove AF_ALG's async I/O support:

    https://lore.kernel.org/linux-crypto/20260430011544.31823-1-ebiggers@kernel.org/
    https://lore.kernel.org/linux-crypto/20260504225328.25356-1-ebiggers@kernel.org/
    https://lore.kernel.org/linux-crypto/20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com/

In practice, the programs that are keeping Linux distros from disabling
AF_ALG in their kconfig outright are just iwd, cryptsetup, and bluez.
They use AF_ALG just because it was mistakenly thought to be easier than
using a userspace crypto library.  They don't need maximum performance,
nor do they use vmsplice, splice, or sendfile.

There is other highly niche code out there that does implement the
AF_ALG + vmsplice + splice thing, e.g. libkcapi.  But it's just not
enough of a reason to keep zero-copy support, especially considering
that AF_ALG has always been the wrong solution in the first place.  The
fallback to copying the data is fine for this deprecated API.

- Eric

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by David Hildenbrand (Arm) 4 days, 23 hours ago

On 6/2/26 20:44, Eric Biggers wrote:
> On Tue, Jun 02, 2026 at 10:25:06AM +0200, David Hildenbrand (Arm) wrote:
>> On 6/2/26 02:28, Andrew Morton wrote:
>>>
>>>
>>> Well yes, The patchset seems sensible from a quality POV.  But to make
>>> a decision we should first have a decent understanding of its downside
>>> impact.
>>
>> I guess most (all?) of us ... dislike ... vmsplice(), so trying to remove it
>> entirely is certainly very appealing ...
>>
>>>
>>> I haven't seen a description of that impact in the discussion thus far.
>>> And that description is owed, please.
>>>
>>> I assume a small number of specialized applications are using
>>> vmsplice() to great effect?  What are those applications?  What is the
>>> impact of this change?
>>
>>
>> I did some digging, and the kernel crypto API documents using splice/vmsplice
>> for zero-copy[1] and libkcapi [2].
>>
>> I did not find performance numbers, how much vmsplice/splice actually gives us.
>> Playing with the kcapi-speed tool [3] (specifying --vmsplice vs. --sendmsg)
>> doesn't really reveal a big difference at least on my notebook. Not sure if the
>> parameters I specify are reasonable.
>>
>> I don't know whether downgrading vmsplice to preadv2/pwritev2 would perform
>> significantly worse than sendmsg ... and I don't know what the default would
>> usually be (default to vmsplice or sendmsg). I might try finding some time to
>> play with it more, but I doubt it, so if anybody else has time ... :)
> 
> AF_ALG is a mistake and isn't commonly used.  Using a userspace crypto
> library is faster and is what almost everyone does anyway, as it avoids
> the syscall overhead.  There are many other issues with AF_ALG as well.
> 
> 7.2 will mark AF_ALG as deprecated, mostly remove AF_ALG's zero-copy
> support, and remove AF_ALG's async I/O support:
> 
>     https://lore.kernel.org/linux-crypto/20260430011544.31823-1-ebiggers@kernel.org/
>     https://lore.kernel.org/linux-crypto/20260504225328.25356-1-ebiggers@kernel.org/
>     https://lore.kernel.org/linux-crypto/20260523-af-alg-harden-v1-0-c76755c3a5c5@gmail.com/
> 
> In practice, the programs that are keeping Linux distros from disabling
> AF_ALG in their kconfig outright are just iwd, cryptsetup, and bluez.
> They use AF_ALG just because it was mistakenly thought to be easier than
> using a userspace crypto library.  They don't need maximum performance,
> nor do they use vmsplice, splice, or sendfile.
> 
> There is other highly niche code out there that does implement the
> AF_ALG + vmsplice + splice thing, e.g. libkcapi.  But it's just not
> enough of a reason to keep zero-copy support, especially considering
> that AF_ALG has always been the wrong solution in the first place.  The
> fallback to copying the data is fine for this deprecated API.

Cool, thanks for sharing that Eric!

-- 
Cheers,

David

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Andy Lutomirski 1 week ago

On Sat, May 30, 2026 at 6:03 PM Askar Safin <safinaskar@gmail.com> wrote:
>
> See recent discussion here:
> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
>
> For all these reasons I propose to make vmsplice a simple wrapper for
> preadv2/pwritev2.
>

I have no comment on the code or the history.  But I'm 100% in favor
of the solution.  vmsplice is a crappy API, and would be incredibly
complex to get the implementation right,  and it should be removed.
But it has users, and the approach of just mapping them straight to
pread/pwrite makes perfect sense.

(If anyone wants to contemplate how bad the API is, contemplate gift
mode.  Or contemplate that, if you want correct results, you need to
avoid modifying the memory until the recipient is done reading or you
need to avoid reading the memory until the writer is done writing, and
vmsplice *does not tell you when it's done*.  And there isn't even a
caller specification of whether they want to read or write.  It's ...
crap.)

--Andy

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Matthew Wilcox 6 days, 15 hours ago

On Sun, May 31, 2026 at 08:11:34PM -0700, Andy Lutomirski wrote:
> On Sat, May 30, 2026 at 6:03 PM Askar Safin <safinaskar@gmail.com> wrote:
> >
> > See recent discussion here:
> > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> >
> > For all these reasons I propose to make vmsplice a simple wrapper for
> > preadv2/pwritev2.
> >
> 
> I have no comment on the code or the history.  But I'm 100% in favor
> of the solution.  vmsplice is a crappy API, and would be incredibly
> complex to get the implementation right,  and it should be removed.
> But it has users, and the approach of just mapping them straight to
> pread/pwrite makes perfect sense.

I agree with Andy.  I think it was appropriate to send this series, since
(as far as I can tell) it's a completely different approach from the others
taken.  I'm not really qualified to judge whether the implementation is
good (it's a bit outside my competency as a reviewer), but the described
approach is more convincing to me than the other approaches.

Can we review this series properly?

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 6 days, 15 hours ago

On Mon, 1 Jun 2026 at 08:36, Matthew Wilcox <willy@infradead.org> wrote:
>
> Can we review this series properly?

Well, since it pretty much is what I suggested a few years ago, I
certainly won't NAK it.

And the patches looked very straightforward to me. Just the final
diffstat is worth quoting again because that certainly doesn't look
problematic:

  7 files changed, 33 insertions(+), 204 deletions(-)

and it removes that GIFT flag that was truly disgusting.

So I'm certainly ok with it from a "looking at the patch" standpoint.
I didn't _test_ it. I don't have any workload that might remotely
care.

I did a quick scan on debian code search for vmsplice, and after ten
pages of entries that weren't actually *using* it but had lists of
system calls, I grew bored. So there are likely users, but I don't
know what they are and how much they care. It *might* be a big
performance issue somewhere. Unlikely, but...

         Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by David Howells 4 days, 12 hours ago

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Well, since it pretty much is what I suggested a few years ago, I
> certainly won't NAK it.

I've been wanting to get rid of vmsplice for a while, so I'm in favour of this
too.

David

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Christian Brauner 6 days, 15 hours ago

On Mon, Jun 01, 2026 at 08:50:00AM -0700, Linus Torvalds wrote:
> On Mon, 1 Jun 2026 at 08:36, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Can we review this series properly?
> 
> Well, since it pretty much is what I suggested a few years ago, I
> certainly won't NAK it.
> 
> And the patches looked very straightforward to me. Just the final
> diffstat is worth quoting again because that certainly doesn't look
> problematic:
> 
>   7 files changed, 33 insertions(+), 204 deletions(-)
> 
> and it removes that GIFT flag that was truly disgusting.
> 
> So I'm certainly ok with it from a "looking at the patch" standpoint.
> I didn't _test_ it. I don't have any workload that might remotely
> care.
> 
> I did a quick scan on debian code search for vmsplice, and after ten
> pages of entries that weren't actually *using* it but had lists of
> system calls, I grew bored. So there are likely users, but I don't
> know what they are and how much they care. It *might* be a big
> performance issue somewhere. Unlikely, but...

As usual I would argue to accept it and revert in case we get actual
regression reports...

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 6 days, 15 hours ago

On Mon, 1 Jun 2026 at 09:17, Christian Brauner <brauner@kernel.org> wrote:
>
> As usual I would argue to accept it and revert in case we get actual
> regression reports...

Yes, likely the only way we'd ever find out ..

          Linus

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Askar Safin 4 days, 6 hours ago

Askar Safin <safinaskar@gmail.com>:
> For all these reasons I propose to make vmsplice a simple wrapper for
> preadv2/pwritev2.

This patchset is already in next, but I still kindly ask people to
carefully review it. I'm still a new contributor, and I can make mistakes.

For example, in vmsplice I do "CLASS(fd, f)(fd)" and then I pass
"fd" (i. e. integer) to "do_writev/do_readv". I don't know whether
this is okay to do so.

-- 
Askar Safin

Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

Posted by Linus Torvalds 4 days, 5 hours ago

On Wed, 3 Jun 2026 at 17:46, Askar Safin <safinaskar@gmail.com> wrote:
>
> For example, in vmsplice I do "CLASS(fd, f)(fd)" and then I pass
> "fd" (i. e. integer) to "do_writev/do_readv". I don't know whether
> this is okay to do so.

Oh, good point.

It's ok in the sense that it will work, and it's not really going to
cause problems, but it does mean that the 'struct file' will be looked
up twice.

And *technically* it's a TOCTOU race, where the first time you look it
up - in the vmsplice() wrapper - it could be one file, and you make
decisions based on that. And then pass it off to do_writev(), and it
will look it up again, and now it might be a different file.

Does it *matter*? No. Even if the file changed, and is now something
else, it's just going to be a different file that the user does
writev() on. do_writev() will still do all the appropriate safety
checks etc, so it doesn't really change anything. It just means that
you could pass what you *think* is a pipe (because you did that

+       if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
+               return -EBADF;

and by the time do_writev() then looks up the fd again it might be
something else, and now the user used vmsplice() as a really odd way
to write to a another non-pipe file instead. But the user could have
done that with a regular writev(), so it's just the user being silly -
not something that really confuses the kernel.

Coimpletely harmless, in other words.

But it would probably be *cleaner* to pass in the 'struct file *'
pointer that you already looked up once instead, and use vfs_writev()
instead of do_writev().

And I do suspect that the wrapper system call should use the same

   SYSCALL_DEFINE4(vmsplice, int, fd, ..

that the original used. Because it somebody crazy had the high bits
set in 'fd', the old vmsplice() system call didn't care, but your new
emulation system call will actually see the high bits on a 64-bit
architecture.

Again - that doesn't actually *matter*, because "CLASS(fd)" takes an
"int fd" and those high bits will be masked out at use time both in
vmsplice() and in do_readv/writev().

So it won't affect any behavior, but it does look a bit odd in the conversion.

And I already answered Christian wrt the change in behavior: I think
RWF_NOWAIT should always be set on the writing side - because splice()
never waited after it filled a pipe - and instead that
SPLICE_F_NONBLOCK flag should be used before write to check for
whether we'll wait *before* doing the write like it used to do with

        ret = wait_for_space(pipe, flags);

in vmsplice_to_pipe().

(On the other side, vmsplice_from_pipe() used to do
pipe_clear_nowait(), but I think that becomes a non-issue with the
conversion to readv()).

And once you need wait_for_space(), that probably means that the new
vmsplice() wrapper simpler needs to remain inside fs/splice.c, and we
just need to make vfs_readv/vfs_writev non-static.

              Linus