fs/splice: allow for a way to block splice() with read-only files

[RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Pedro Falcato 1 week, 1 day ago

Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
and Fragnasia, splicing a read-only file is fundamentally unsafe.

As such, as a mitigation, add a way for users to block splice() for
files they cannot write to. This eliminates this whole class of exploits
that use splice()+confusion in pipe/net/etc code to gain write-access to
files they can only read.

Users can simply toggle fs.splice_needs_write=1 and suddenly splice() will
refuse perfectly legal splices() from files it can only read, but not write.

For vmsplice(), make due with the address_space attached to the folio. Care
is held to make sure the operation isn't too slowed down with locks. The check
itself isn't entirely equivalent (the mapping's host can be the internal bdev
inode, etc, and not the one in /dev against which permissions are checked),
but doing it in a more correct way would require dropping from GUP-fast to
GUP, and that would be too slow.

Signed-off-by: Pedro Falcato <pfalcato@suse.de>
---

Hello,

sending this out as an RFC so I can get better opinions from VFS & security
folks upstream. I wrote this out as a way to harden against all the page
cache attacks we've seen lately, that bottom out to splice() from a file
they cannot write + confusion elsewhere on the net stack/pipes/etc.

This is _obviously_ not perfect and not complete. My first (unsent) version
straight up returned -EPERM on splice() for these files. This one attempts
to retain some compatibility by only blocking the page splicing operation,
but still issuing the operation with normal copies (kindly suggested by Jan).
vmsplice() is a complicated issue, because gup_fast does not allow us access
to the VMA's vm_file. I tried hacking around it but it's not perfect (e.g you
cannot grab the mnt_idmap for the file, since we only have access to the
address_space + its host).
I'm also not a fan of having somewhat hairy MM code in the middle of
fs/splice.c but that's something we can simply hoist elsewhere as this gets
un-RFC'd. It's also missing the external-facing docs for the sysctl.

My big questions are:
1) Is this a viable way forward?
2) How hard is it to deploy for users? Could we possibly even default to this?

Lightly tested using [1], which should see (if the mitigation is enabled):

$ ./a.out
[   17.283456] splice: task a.out/275 attempted to splice a file it cannot write to.
This is disabled by the fs.splice_needs_write sysctl. If this is truly required
then disable the sysctl. Performance may be degraded.
splice = 32

[1] https://gist.github.com/heatd/277339eb25df57d1f750d7da757cf7ff

 fs/splice.c  | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/sysctls.c | 11 +++++++
 2 files changed, 93 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index 9d8f63e2fd1a..624e5cbd42eb 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -39,6 +39,8 @@
 
 #include "internal.h"
 
+int sysctl_splice_needs_write;
+
 /*
  * Splice doesn't support FMODE_NOWAIT. Since pipes may set this flag to
  * indicate they support non-blocking reads or writes, we must clear it
@@ -947,6 +949,36 @@ static void do_splice_eof(struct splice_desc *sd)
 		sd->splice_eof(sd);
 }
 
+static bool splice_may_read(struct file *in)
+{
+	if (READ_ONCE(sysctl_splice_needs_write)) {
+		/*
+		 * Disallow splice from files we cannot in any way write to.
+		 * This serves as a trivial mitigation for page cache splice
+		 * attacks, where attackers splice a file they cannot write to,
+		 * but can read (like /etc/passwd, or /bin/su) and, through a
+		 * complex set of logic, manage to write to these page cache
+		 * pages.
+		 * The set of checks is quite simple: if we can already write
+		 * to it, if we could write to it (by reopening), or if we
+		 * could chmod it (owner of the file), then any vulnerability
+		 * coming from this is futile, as you can already write to it
+		 * normally.
+		 */
+		if (!(in->f_mode & FMODE_WRITE) &&
+		    !inode_owner_or_capable(file_mnt_idmap(in), file_inode(in)) &&
+		    file_permission(in, MAY_WRITE)) {
+			pr_warn_once("splice: task %s/%d attempted to splice"
+			" a file it cannot write to. This is disabled by the"
+			" fs.splice_needs_write sysctl. If this is truly required"
+			" then disable the sysctl. Performance may be degraded.\n",
+			current->comm, current->pid);
+			return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Callers already called rw_verify_area() on the entire range.
  * No need to call it for sub ranges.
@@ -971,11 +1003,13 @@ static ssize_t do_splice_read(struct file *in, loff_t *ppos,
 
 	if (unlikely(!in->f_op->splice_read))
 		return warn_unsupported(in, "read");
+
 	/*
 	 * O_DIRECT and DAX don't deal with the pagecache, so we allocate a
 	 * buffer, copy into it and splice that into the pipe.
 	 */
-	if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host))
+	if ((in->f_flags & O_DIRECT) || IS_DAX(in->f_mapping->host) ||
+	    !splice_may_read(in))
 		return copy_splice_read(in, ppos, pipe, len, flags);
 	return in->f_op->splice_read(in, ppos, pipe, len, flags);
 }
@@ -1440,10 +1474,51 @@ static ssize_t __do_splice(struct file *in, loff_t __user *off_in,
 	return ret;
 }
 
+static bool may_write_to_page(struct page *page, struct address_space **plast)
+{
+	struct folio *folio = page_folio(page);
+	struct address_space *mapping, *last = *plast;
+	struct inode *inode;
+	bool may = false;
+
+	if (!READ_ONCE(sysctl_splice_needs_write))
+		return true;
+	/*
+	 * Always fine to write to anon folios.
+	 */
+	if (folio_test_anon(folio))
+		return true;
+
+	mapping = READ_ONCE(folio->mapping);
+	WARN_ON((unsigned long) mapping & FOLIO_MAPPING_FLAGS);
+
+	/* If it is the same (locklessly), then LGTM, proceed. */
+	if (mapping == last)
+		return true;
+	/*
+	 * Else we have to recheck with the folio lock held, for mapping
+	 * stability. TODO: killable?
+	 */
+	folio_lock(folio);
+	mapping = folio_mapping(folio);
+	/* May have been truncated, etc */
+	if (!mapping)
+		goto out_lock;
+	inode = mapping->host;
+	may = inode_owner_or_capable(&nop_mnt_idmap, inode) ||
+	      inode_permission(&nop_mnt_idmap, inode, MAY_WRITE) == 0;
+	if (likely(may))
+		*plast = mapping;
+out_lock:
+	folio_unlock(folio);
+	return may;
+}
+
 static ssize_t iter_to_pipe(struct iov_iter *from,
 			    struct pipe_inode_info *pipe,
 			    unsigned int flags)
 {
+	struct address_space *last = NULL;
 	struct pipe_buffer buf = {
 		.ops = &user_page_pipe_buf_ops,
 		.flags = flags
@@ -1467,6 +1542,12 @@ static ssize_t iter_to_pipe(struct iov_iter *from,
 		for (i = 0; i < n; i++) {
 			int size = umin(left, PAGE_SIZE - start);
 
+			if (!may_write_to_page(pages[i], &last)) {
+				iov_iter_revert(from, left);
+				while (i < n)
+					put_page(pages[i++]);
+				goto out;
+			}
 			buf.page = pages[i];
 			buf.offset = start;
 			buf.len = size;
diff --git a/fs/sysctls.c b/fs/sysctls.c
index ad429dffeb4b..1fcd7c9f92d5 100644
--- a/fs/sysctls.c
+++ b/fs/sysctls.c
@@ -7,6 +7,8 @@
 #include <linux/init.h>
 #include <linux/sysctl.h>
 
+extern int sysctl_splice_needs_write;
+
 static const struct ctl_table fs_shared_sysctls[] = {
 	{
 		.procname	= "overflowuid",
@@ -26,6 +28,15 @@ static const struct ctl_table fs_shared_sysctls[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_MAXOLDUID,
 	},
+	{
+		.procname	= "splice_needs_write",
+		.data		= &sysctl_splice_needs_write,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 };
 
 static int __init init_fs_sysctls(void)
-- 
2.54.0

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Christian Brauner 6 days, 18 hours ago

On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
> and Fragnasia, splicing a read-only file is fundamentally unsafe.
> 
> As such, as a mitigation, add a way for users to block splice() for
> files they cannot write to. This eliminates this whole class of exploits
> that use splice()+confusion in pipe/net/etc code to gain write-access to
> files they can only read.
> 
> Users can simply toggle fs.splice_needs_write=1 and suddenly splice() will
> refuse perfectly legal splices() from files it can only read, but not write.
> 
> For vmsplice(), make due with the address_space attached to the folio. Care
> is held to make sure the operation isn't too slowed down with locks. The check
> itself isn't entirely equivalent (the mapping's host can be the internal bdev
> inode, etc, and not the one in /dev against which permissions are checked),
> but doing it in a more correct way would require dropping from GUP-fast to
> GUP, and that would be too slow.
> 
> Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> ---
> 
> Hello,
> 
> sending this out as an RFC so I can get better opinions from VFS & security
> folks upstream. I wrote this out as a way to harden against all the page
> cache attacks we've seen lately, that bottom out to splice() from a file
> they cannot write + confusion elsewhere on the net stack/pipes/etc.
> 
> This is _obviously_ not perfect and not complete. My first (unsent) version
> straight up returned -EPERM on splice() for these files. This one attempts
> to retain some compatibility by only blocking the page splicing operation,
> but still issuing the operation with normal copies (kindly suggested by Jan).
> vmsplice() is a complicated issue, because gup_fast does not allow us access
> to the VMA's vm_file. I tried hacking around it but it's not perfect (e.g you
> cannot grab the mnt_idmap for the file, since we only have access to the
> address_space + its host).
> I'm also not a fan of having somewhat hairy MM code in the middle of
> fs/splice.c but that's something we can simply hoist elsewhere as this gets
> un-RFC'd. It's also missing the external-facing docs for the sysctl.
> 
> My big questions are:
> 1) Is this a viable way forward?

I think that splice and vmsplice() are pretty wonky apis. Ignoring it's
recent prominent role in page cache attacks it suffers from weird issues
due to its interactions with pipe_lock().

Bug with splice to a pipe preventing a process exit
20250122020850.2175427-1-kolyshkin@gmail.com
Sendfile holding pipe->mutex blocks the peer's pipe_release() from do_exit().

Change in splice() behaviour after 5.10? (LTP splice07)
7F3B484F-9555-486A-B19A-5A8EB6442988@kernel.org

[PATCH v2 00/11] Avoid unprivileged splice(file->)/(->socket) pipe exclusion
cover.1703126594.git.nabijaczleweli@nabijaczleweli.xyz
Pending splice from tty/socket/FIFO holds pipe->mutex indefinitely, blocking all other FIFO ops incl. read(O_NONBLOCK)

splice: prevent deadlock when splicing a file to itself
20260320130615.1109449-1-kartikey406@gmail.com
do_splice_direct_actor() still lacks file_inode(in) == file_inode(out) guard

AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
2135907.1747061490@warthog.procyon.org.uk
vmsplice/splice into AF_UNIX/pipe doesn't FOLL_PIN the source memory

My main gripe with the patch as written is that I find it really hard to
figure out who would deploy this. It half-cripples splice() and
vmsplice() for some use-cases but leaves it intact for others.

At that point you can also just ENOSYS splice() and vmsplice() via
seccomp and force a fallback on non-splice codepaths that userspace has
to have anyway as splice() isn't supported unconditionally.

It feels like a knee-jerk reaction to an exploit class originating in
buggy modules that we have little control over and we would extend an
API to users that is really difficult to use.

What might make more sense is to add a splice specific security_*() hook
into the code so that an LSM can deny usage of splice in whatever way it
wants to - bpf lsm or in-tree lsm.

Then we don't have to have all this gunk in the VFS layer that will be
annoying to maintain with little value in the long-term. So I'm not very
likely to pick this up as is.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Jann Horn 6 days, 11 hours ago

On Mon, May 18, 2026 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
> On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> > Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
> > and Fragnasia, splicing a read-only file is fundamentally unsafe.
> >
> > As such, as a mitigation, add a way for users to block splice() for
> > files they cannot write to. This eliminates this whole class of exploits
> > that use splice()+confusion in pipe/net/etc code to gain write-access to
> > files they can only read.
> >
> > Users can simply toggle fs.splice_needs_write=1 and suddenly splice() will
> > refuse perfectly legal splices() from files it can only read, but not write.
[...]
> At that point you can also just ENOSYS splice() and vmsplice() via
> seccomp and force a fallback on non-splice codepaths that userspace has
> to have anyway as splice() isn't supported unconditionally.
>
> It feels like a knee-jerk reaction to an exploit class originating in
> buggy modules that we have little control over and we would extend an
> API to users that is really difficult to use.
>
> What might make more sense is to add a splice specific security_*() hook
> into the code so that an LSM can deny usage of splice in whatever way it
> wants to - bpf lsm or in-tree lsm.

I feel like a sysctl for "disable all the splice-like interfaces and
zerocopy TX" would be reasonable to have? Either by blocking such
operations, or better, silently downgrading all such operations to
normal copies.

FWIW, vmsplice() and splice() are also weird in how much memory they
can implicitly pin - if you call vmsplice() on a single byte in a 2M
THP page, I believe you'll implicitly pin 2M of memory...

By the way, another bug a few years ago of a similar shape was this
one - also a bug in networking code that can lead to accidental writes
into spliced pages:
https://project-zero.issues.chromium.org/issues/42451650 ("ktls writes
into spliced readonly pages")
with fix: https://git.kernel.org/linus/c5a595000e26

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Christian Brauner 5 days, 20 hours ago

On 2026-05-18 20:59 +0200, Jann Horn wrote:
> On Mon, May 18, 2026 at 2:30 PM Christian Brauner <brauner@kernel.org> wrote:
> > On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> > > Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
> > > and Fragnasia, splicing a read-only file is fundamentally unsafe.
> > >
> > > As such, as a mitigation, add a way for users to block splice() for
> > > files they cannot write to. This eliminates this whole class of exploits
> > > that use splice()+confusion in pipe/net/etc code to gain write-access to
> > > files they can only read.
> > >
> > > Users can simply toggle fs.splice_needs_write=1 and suddenly splice() will
> > > refuse perfectly legal splices() from files it can only read, but not write.
> [...]
> > At that point you can also just ENOSYS splice() and vmsplice() via
> > seccomp and force a fallback on non-splice codepaths that userspace has
> > to have anyway as splice() isn't supported unconditionally.
> >
> > It feels like a knee-jerk reaction to an exploit class originating in
> > buggy modules that we have little control over and we would extend an
> > API to users that is really difficult to use.
> >
> > What might make more sense is to add a splice specific security_*() hook
> > into the code so that an LSM can deny usage of splice in whatever way it
> > wants to - bpf lsm or in-tree lsm.
> 
> I feel like a sysctl for "disable all the splice-like interfaces and
> zerocopy TX" would be reasonable to have? Either by blocking such
> operations, or better, silently downgrading all such operations to
> normal copies.

I think blocking isn't going to be useful as it will make it harder for
distros to turn this on. So we should degrade.

> FWIW, vmsplice() and splice() are also weird in how much memory they
> can implicitly pin - if you call vmsplice() on a single byte in a 2M
> THP page, I believe you'll implicitly pin 2M of memory...

You don't have to convince me that it's a problematic api.

Let's discuss the other aggressive alternative: Can we try and
unconditionally degrade to copy. This would affect sendfile(), splice(),
and vmsplice(). Worst-case we would have to introduce the sysctl
retroactively.

Thoughts?

> By the way, another bug a few years ago of a similar shape was this
> one - also a bug in networking code that can lead to accidental writes
> into spliced pages:
> https://project-zero.issues.chromium.org/issues/42451650 ("ktls writes
> into spliced readonly pages")
> with fix: https://git.kernel.org/linus/c5a595000e26

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Askar Safin 1 day, 10 hours ago

Christian Brauner <brauner@kernel.org>:
> Let's discuss the other aggressive alternative: Can we try and
> unconditionally degrade to copy. This would affect sendfile(), splice(),
> and vmsplice(). Worst-case we would have to introduce the sysctl
> retroactively.
> 
> Thoughts?

I think as a first step we should make vmsplice unconditionally equivalent
to readv/writev.

vmsplice already was problematic from security point of view long time
ago. I mean CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).

David Howells also doesn't like vmsplice:
https://lore.kernel.org/all/1763225.1769180226@warthog.procyon.org.uk/

Linus said in 2023:
> So I'd personally be perfectly ok with just making vmsplice() be
> exactly the same as write, and turn all of vmsplice() into just "it's
> a read() if the pipe is open for read, and a write if it's open for
> writing".
https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/

Even experts get vmsplice wrong, as can be seen in this thread:
https://lore.kernel.org/all/CAAUqJDvFuvms55Td1c=XKv6epfRnnP78438nZQ-JKyuCptGBiQ@mail.gmail.com/T/#u
As you can see in that thread, it is very hard to understand what vmsplice
man page supposed to mean. And you can also see that vmsplice is very
fragile.

-- 
Askar Safin

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Mateusz Guzik 5 days, 19 hours ago

On Tue, May 19, 2026 at 11:49 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On 2026-05-18 20:59 +0200, Jann Horn wrote:
> > I feel like a sysctl for "disable all the splice-like interfaces and
> > zerocopy TX" would be reasonable to have? Either by blocking such
> > operations, or better, silently downgrading all such operations to
> > normal copies.
>
[..]
> I think blocking isn't going to be useful as it will make it harder for
> distros to turn this on. So we should degrade.
>
[..]
> Let's discuss the other aggressive alternative: Can we try and
> unconditionally degrade to copy. This would affect sendfile(), splice(),
> and vmsplice(). Worst-case we would have to introduce the sysctl
> retroactively.
>

I know at least nginx uses sendfile, but I never benchmarked how much it buys.

The original patch as proposed filters by rw perms on the file, which
I expect to exclude nginx.

While kernel-internal copy is still going to beat a userspace-based
read/write loop, this is still going to be a hit and I expect people
are going to complain. Afterwards you may end up with tutorials how to
re-enable pre-patch behavior, partially defeating the point.

How about denial of splice usage or degradation to copy are still on
the table, but based on a different criterion: whether code involved
is "known good" for lack of a better description. iow the kernel would
maintain a whitelist of "safe" cases. Random-ass AF_NOBODYEVERHEARDOF
does not make the cut.

Common-case usage would have to be audited of course, but this sounds
rather actionable and would provide hardening without much friction.

I can't stress enough that mucking around splice (even if worthwhile)
is merely addressing the currently popular attack vector and not the
general problem.

The general problem is that the kernel is expected to be able to run
with untrusted unprivileged users, while it avoidably exposes a huge
attack surface. Of course there is no way around providing a bunch of
syscalls to users, so *some* danger will always be there and one has
to expect that even core code has bugs which will be discovered by
LLMs in the coming months. Even then, there is tons of code which is
currently being audited by third parties and which has no use in most
setups. Instead it gets autoloaded in response to an exploit wishing
to take advantage of its bugs.

The huge attack surface was always a problematic position to be in,
but with the advent lf LLMs any unskilled person can drop a 0day and
the position is straight up untenable. In the long run there is no way
around blocking access to code by default, way beyond the current
splice proposal.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Jann Horn 5 days, 14 hours ago

On Tue, May 19, 2026 at 12:51 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> I can't stress enough that mucking around splice (even if worthwhile)
> is merely addressing the currently popular attack vector and not the
> general problem.
[...]
> The huge attack surface was always a problematic position to be in,
> but with the advent lf LLMs any unskilled person can drop a 0day and
> the position is straight up untenable. In the long run there is no way
> around blocking access to code by default, way beyond the current
> splice proposal.

Sure - but the path to that is putting restrictions on the
availability of individual kernel features, and this proposal is one
step toward that.

These splice/vmsplice functions have been part of security bugs in the
past, and have contributed to making other security bugs easier to
exploit too. I think it's a sufficiently big problem area to warrant a
disable toggle, especially since this is more or less just a
performance optimization that we should be able to nerf without
outright breaking anything.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by James Bottomley 5 days, 17 hours ago

On Tue, 2026-05-19 at 12:51 +0200, Mateusz Guzik wrote:
[...]
> I can't stress enough that mucking around splice (even if worthwhile)
> is merely addressing the currently popular attack vector and not the
> general problem.
> 
> The general problem is that the kernel is expected to be able to run
> with untrusted unprivileged users, while it avoidably exposes a huge
> attack surface. Of course there is no way around providing a bunch of
> syscalls to users, so *some* danger will always be there and one has
> to expect that even core code has bugs which will be discovered by
> LLMs in the coming months. Even then, there is tons of code which is
> currently being audited by third parties and which has no use in most
> setups. Instead it gets autoloaded in response to an exploit wishing
> to take advantage of its bugs.
> 
> The huge attack surface was always a problematic position to be in,
> but with the advent lf LLMs any unskilled person can drop a 0day and
> the position is straight up untenable. In the long run there is no
> way around blocking access to code by default, way beyond the current
> splice proposal.

Attack surface is a great measure for making lower bound security
assertions and proofs.  I mean I've used it myself to build a container
that was provably more secure than a VM:

https://blog.hansenpartnership.com/measuring-the-horizontal-attack-profile-of-nabla-containers/

But the point is it is a lower bound: security is always better than
the attack surface measure says.  The problem with the measure for
something like a kernel is that the kernel's job is to provide services
to untrusted users, so amazingly enough a plurality of its code goes to
this function making, as you say, the attack surface huge.  I'm sure
there are low hanging little used interfaces we could remove to lower
the attack surface, and perhaps we could voluntarily wall off large
areas for "secure" users.  However, security has always been a tradeoff
for usability (the most secure PC is one that's powered off) so the
more you wall off the smaller the pool of actual users becomes.

I think we should be spending our time on better interface design.  As
Kees' security project proves: classes of bug can be eliminated via
various techniques (effectively rendering extant defects unexploitable)
and we can certainly do a better job of error legs, which seems to be
where AI is turning up the majority of the issues.

The problem with the above is that while it's easy to measure bug
density and correlate it directly to attack surface as lower bound
mathematics, it's really hard to measure the reductions in
exploitability potential that better API, coding and other security
techniques give us.

I really believe we've done significant improvements to reduce our
exploitability, but I can't (yet) measure it.  That makes it easy to
claim the sky is falling due to the size of our attack surface, but
doing so effectively ignores every exploitability improvement we've
made over the years (and thus ignores the hard work and dedication of a
large group of individuals).

Regards,

James

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Christian Brauner 5 days, 19 hours ago

On 2026-05-19 12:51 +0200, Mateusz Guzik wrote:
> On Tue, May 19, 2026 at 11:49 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On 2026-05-18 20:59 +0200, Jann Horn wrote:
> > > I feel like a sysctl for "disable all the splice-like interfaces and
> > > zerocopy TX" would be reasonable to have? Either by blocking such
> > > operations, or better, silently downgrading all such operations to
> > > normal copies.
> >
> [..]
> > I think blocking isn't going to be useful as it will make it harder for
> > distros to turn this on. So we should degrade.
> >
> [..]
> > Let's discuss the other aggressive alternative: Can we try and
> > unconditionally degrade to copy. This would affect sendfile(), splice(),
> > and vmsplice(). Worst-case we would have to introduce the sysctl
> > retroactively.
> >
> 
> I know at least nginx uses sendfile, but I never benchmarked how much it buys.
> 
> The original patch as proposed filters by rw perms on the file, which
> I expect to exclude nginx.
> 
> While kernel-internal copy is still going to beat a userspace-based
> read/write loop, this is still going to be a hit and I expect people
> are going to complain. Afterwards you may end up with tutorials how to
> re-enable pre-patch behavior, partially defeating the point.
> 
> How about denial of splice usage or degradation to copy are still on
> the table, but based on a different criterion: whether code involved
> is "known good" for lack of a better description. iow the kernel would
> maintain a whitelist of "safe" cases. Random-ass AF_NOBODYEVERHEARDOF
> does not make the cut.

I had thought about that to but I felt a bit iffy about it. You could
envision an FOP_* flag for this:

  /* Module may use splice-like apis */
  #define FOP_MAY_SPLICE          ((__force fop_flags_t)(1 << 8))

But that doesn't address how fundamentally broken vmsplice() for example
really is and that probably no one should get to use it in its current
form.

> Common-case usage would have to be audited of course, but this sounds
> rather actionable and would provide hardening without much friction.

And that's the usual problem where rando module will just raise the
flag. Maybe that's fine and we will keep up.

> I can't stress enough that mucking around splice (even if worthwhile)
> is merely addressing the currently popular attack vector and not the
> general problem.
> 
> The general problem is that the kernel is expected to be able to run
> with untrusted unprivileged users, while it avoidably exposes a huge
> attack surface. Of course there is no way around providing a bunch of
> syscalls to users, so *some* danger will always be there and one has
> to expect that even core code has bugs which will be discovered by
> LLMs in the coming months. Even then, there is tons of code which is
> currently being audited by third parties and which has no use in most
> setups. Instead it gets autoloaded in response to an exploit wishing
> to take advantage of its bugs.
> 
> The huge attack surface was always a problematic position to be in,
> but with the advent lf LLMs any unskilled person can drop a 0day and
> the position is straight up untenable. In the long run there is no way
> around blocking access to code by default, way beyond the current
> splice proposal.

I see this is the "let's become goat-farmers" portion of the message.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Mateusz Guzik 5 days, 18 hours ago

On Tue, May 19, 2026 at 12:59 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On 2026-05-19 12:51 +0200, Mateusz Guzik wrote:
> > How about denial of splice usage or degradation to copy are still on
> > the table, but based on a different criterion: whether code involved
> > is "known good" for lack of a better description. iow the kernel would
> > maintain a whitelist of "safe" cases. Random-ass AF_NOBODYEVERHEARDOF
> > does not make the cut.
>
> I had thought about that to but I felt a bit iffy about it. You could
> envision an FOP_* flag for this:
>
>   /* Module may use splice-like apis */
>   #define FOP_MAY_SPLICE          ((__force fop_flags_t)(1 << 8))
>
> But that doesn't address how fundamentally broken vmsplice() for example
> really is and that probably no one should get to use it in its current
> form.

I never looked into the area on Linux, I am willing to take claims of
breakage on face value.

In this context I'm saying the functionality is used in the real world
for performance reasons and just whacking imo does not cut it.

>
> > Common-case usage would have to be audited of course, but this sounds
> > rather actionable and would provide hardening without much friction.
>
> And that's the usual problem where rando module will just raise the
> flag. Maybe that's fine and we will keep up.
>

If this is a genuine worry the whitelist can still be introduced and
managed by one person (Linus?) very easily. The implementation is only
mildly cumbersome to get going and trivial to spread afterwards.

You could have something like this:
struct file_operations funky_ops = {
....
};
VFS_FILE_OPS_REGISTER(funky_ops);

This would call into a routine which checks if the ops at hand are
allowed splice support, so sneaking in the flag wont work.

The vfs layer would BUG out if unregistered ops got used on a file.

It would take explicitly malicious action to bypass the mechanism, but
if someone is doing that, they are presumably in position to introduce
vulns in another way.

> > The huge attack surface was always a problematic position to be in,
> > but with the advent lf LLMs any unskilled person can drop a 0day and
> > the position is straight up untenable. In the long run there is no way
> > around blocking access to code by default, way beyond the current
> > splice proposal.
>
> I see this is the "let's become goat-farmers" portion of the message.

I think this is very much salvageable, but action will need to be
taken by a real team (in turn meaning a buy in from a real org). While
one can boast about some mitigations, nothing beats not allowing given
code to run in the first place.

Poor man's initial solution would be to check lsmod and block
everything not currently loaded.

Longer term *something* will need to be implemented to fine-grain
access to code, including stuff shipping in core kernel. There are
some ingredients to do it already (seccomp and the LSM framework,
maybe even some of the currently available modules).

With that and money spent on auditing the stuff everyone has to use
anyway (vm, vfs etc.) overall stability from security standpoint would
be restored -- a 0day dropped against mod_nobody_ever_heard_of will
stop being an emergency.

While I'm just handwaving here and wont taking part in any of it,
restoring stability from security POV is very much doable as far as
engineering goes.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Christian Brauner 2 days, 17 hours ago

On Tue, May 19, 2026 at 01:56:42PM +0200, Mateusz Guzik wrote:
> On Tue, May 19, 2026 at 12:59 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On 2026-05-19 12:51 +0200, Mateusz Guzik wrote:
> > > How about denial of splice usage or degradation to copy are still on
> > > the table, but based on a different criterion: whether code involved
> > > is "known good" for lack of a better description. iow the kernel would
> > > maintain a whitelist of "safe" cases. Random-ass AF_NOBODYEVERHEARDOF
> > > does not make the cut.
> >
> > I had thought about that to but I felt a bit iffy about it. You could
> > envision an FOP_* flag for this:
> >
> >   /* Module may use splice-like apis */
> >   #define FOP_MAY_SPLICE          ((__force fop_flags_t)(1 << 8))
> >
> > But that doesn't address how fundamentally broken vmsplice() for example
> > really is and that probably no one should get to use it in its current
> > form.
> 
> I never looked into the area on Linux, I am willing to take claims of
> breakage on face value.
> 
> In this context I'm saying the functionality is used in the real world
> for performance reasons and just whacking imo does not cut it.
> 
> >
> > > Common-case usage would have to be audited of course, but this sounds
> > > rather actionable and would provide hardening without much friction.
> >
> > And that's the usual problem where rando module will just raise the
> > flag. Maybe that's fine and we will keep up.
> >
> 
> If this is a genuine worry the whitelist can still be introduced and
> managed by one person (Linus?) very easily. The implementation is only
> mildly cumbersome to get going and trivial to spread afterwards.
> 
> You could have something like this:
> struct file_operations funky_ops = {
> ....
> };
> VFS_FILE_OPS_REGISTER(funky_ops);

Unless someone is going to do that work a sysctl to force degrade all
the APIs into copies should be ok.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Christoph Hellwig 6 days ago

On Mon, May 18, 2026 at 08:59:13PM +0200, Jann Horn wrote:
> I feel like a sysctl for "disable all the splice-like interfaces and
> zerocopy TX" would be reasonable to have? Either by blocking such
> operations, or better, silently downgrading all such operations to
> normal copies.

Yes.

> FWIW, vmsplice() and splice() are also weird in how much memory they
> can implicitly pin - if you call vmsplice() on a single byte in a 2M
> THP page, I believe you'll implicitly pin 2M of memory...

vmsplice is the worst, as it is one of the few remaining places that
can incorrectly dirty file backed pages without telling the file system
and cause the other problems fixed by a FOLL_PIN conversion, but it is
the only one where we do not have any idea yet how we could convert it
to FOLL_PIN due to the unbounded pin time.

Note that we sometimes use splice underneath other operations that do
not have these issue.  The most important one is sendfile, which has
very clearly defined semantics avoid all these pinning problems, but
there also are similar in-kernel users as in nfsd.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Pedro Falcato 6 days, 17 hours ago

On Mon, May 18, 2026 at 02:20:30PM +0200, Christian Brauner wrote:
> On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> > Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
> > and Fragnasia, splicing a read-only file is fundamentally unsafe.
> > 
> > As such, as a mitigation, add a way for users to block splice() for
> > files they cannot write to. This eliminates this whole class of exploits
> > that use splice()+confusion in pipe/net/etc code to gain write-access to
> > files they can only read.
> > 
> > Users can simply toggle fs.splice_needs_write=1 and suddenly splice() will
> > refuse perfectly legal splices() from files it can only read, but not write.
> > 
> > For vmsplice(), make due with the address_space attached to the folio. Care
> > is held to make sure the operation isn't too slowed down with locks. The check
> > itself isn't entirely equivalent (the mapping's host can be the internal bdev
> > inode, etc, and not the one in /dev against which permissions are checked),
> > but doing it in a more correct way would require dropping from GUP-fast to
> > GUP, and that would be too slow.
> > 
> > Signed-off-by: Pedro Falcato <pfalcato@suse.de>
> > ---
> > 
> > Hello,
> > 
> > sending this out as an RFC so I can get better opinions from VFS & security
> > folks upstream. I wrote this out as a way to harden against all the page
> > cache attacks we've seen lately, that bottom out to splice() from a file
> > they cannot write + confusion elsewhere on the net stack/pipes/etc.
> > 
> > This is _obviously_ not perfect and not complete. My first (unsent) version
> > straight up returned -EPERM on splice() for these files. This one attempts
> > to retain some compatibility by only blocking the page splicing operation,
> > but still issuing the operation with normal copies (kindly suggested by Jan).
> > vmsplice() is a complicated issue, because gup_fast does not allow us access
> > to the VMA's vm_file. I tried hacking around it but it's not perfect (e.g you
> > cannot grab the mnt_idmap for the file, since we only have access to the
> > address_space + its host).
> > I'm also not a fan of having somewhat hairy MM code in the middle of
> > fs/splice.c but that's something we can simply hoist elsewhere as this gets
> > un-RFC'd. It's also missing the external-facing docs for the sysctl.
> > 
> > My big questions are:
> > 1) Is this a viable way forward?
> 
> I think that splice and vmsplice() are pretty wonky apis. Ignoring it's
> recent prominent role in page cache attacks it suffers from weird issues
> due to its interactions with pipe_lock().
> 
> Bug with splice to a pipe preventing a process exit
> 20250122020850.2175427-1-kolyshkin@gmail.com
> Sendfile holding pipe->mutex blocks the peer's pipe_release() from do_exit().
> 
> Change in splice() behaviour after 5.10? (LTP splice07)
> 7F3B484F-9555-486A-B19A-5A8EB6442988@kernel.org
> 
> [PATCH v2 00/11] Avoid unprivileged splice(file->)/(->socket) pipe exclusion
> cover.1703126594.git.nabijaczleweli@nabijaczleweli.xyz
> Pending splice from tty/socket/FIFO holds pipe->mutex indefinitely, blocking all other FIFO ops incl. read(O_NONBLOCK)
> 
> splice: prevent deadlock when splicing a file to itself
> 20260320130615.1109449-1-kartikey406@gmail.com
> do_splice_direct_actor() still lacks file_inode(in) == file_inode(out) guard
> 
> AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN
> 2135907.1747061490@warthog.procyon.org.uk
> vmsplice/splice into AF_UNIX/pipe doesn't FOLL_PIN the source memory
> 
> My main gripe with the patch as written is that I find it really hard to
> figure out who would deploy this. It half-cripples splice() and
> vmsplice() for some use-cases but leaves it intact for others.

Not just splice() and vmsplice(), but sendfile(), copy_file_range() too.
My bet (perhaps not informed enough) is that there simply aren't that many
users doing splice-like opeartions from files they do not own in some way.

(maybe not true for copy_file_range(), I admit)

> 
> At that point you can also just ENOSYS splice() and vmsplice() via
> seccomp and force a fallback on non-splice codepaths that userspace has
> to have anyway as splice() isn't supported unconditionally.

IIRC GNU grep is one simple example where they assume splice() from a pipe
to /dev/null Just Works(tm) and it exits(1) otherwise.

> It feels like a knee-jerk reaction to an exploit class originating in
> buggy modules that we have little control over and we would extend an
> API to users that is really difficult to use.
> 
> What might make more sense is to add a splice specific security_*() hook
> into the code so that an LSM can deny usage of splice in whatever way it
> wants to - bpf lsm or in-tree lsm.

I don't dislike that option, but I don't love leaving hardening to LSMs. The
kernel quite literally gets a new splice-related vulnerability every week now,
where userspace gets to pass pages it has no business passing to funky
codepaths that then write on these pages. I feel like natively restricting
what you can pass is simply a natural way forward.

> 
> Then we don't have to have all this gunk in the VFS layer that will be
> annoying to maintain with little value in the long-term. So I'm not very
> likely to pick this up as is.

Totally. That's what the RFC tag is for :)

-- 
Pedro

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Mateusz Guzik 1 week, 1 day ago

On Sat, May 16, 2026 at 8:21 PM Pedro Falcato <pfalcato@suse.de> wrote:
>
> Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
> and Fragnasia, splicing a read-only file is fundamentally unsafe.
>
> As such, as a mitigation, add a way for users to block splice() for
> files they cannot write to. This eliminates this whole class of exploits
> that use splice()+confusion in pipe/net/etc code to gain write-access to
> files they can only read.
>

The patch touches stuff I'm not familiar with, so no comments on that front.

The core idea is a half-measure which will at best buy few weeks until
splice bugs dry out and there will be a new attack vector du jour
which people point their LLMs at. Perhaps it is worth it as a bandaid
until a more complete solution shows up.

A full-measure solution would git rm these modules to
fuck^W^W^W^W^W^Wprevent suspicious modules from being usable by
unprivileged users to begin with, at least by default. Of the cases I
had seen so far, your typical machine either does not have the thing
loaded or in the worst case does not have a legitimate reason to
expose them. Someone(tm) would have to figure out a sensible setup
where stuff stops autoloading and there is a whitelist of modules
allowed on the box, and even then the functionality provided is only
usable by select people.

Back in the day I wrote a LSM which learned what kind of socket
options etc. are being used in a given workload and afterwards only
allowed that. The code never escaped $workplace, but the idea might be
a starting point.

All that aside, is someone with access to many tokens(tm) racing the
bad guys vs splice bugs? Realistically one has to assume there are
some more waiting (as suggested by the patch) and chances are decent
nobody has to be caught with their pants down even without the
aforementioned mitigation.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Christian Brauner 6 days, 18 hours ago

On Sun, May 17, 2026 at 01:51:16AM +0200, Mateusz Guzik wrote:
> On Sat, May 16, 2026 at 8:21 PM Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
> > and Fragnasia, splicing a read-only file is fundamentally unsafe.
> >
> > As such, as a mitigation, add a way for users to block splice() for
> > files they cannot write to. This eliminates this whole class of exploits
> > that use splice()+confusion in pipe/net/etc code to gain write-access to
> > files they can only read.
> >
> 
> The patch touches stuff I'm not familiar with, so no comments on that front.
> 
> The core idea is a half-measure which will at best buy few weeks until
> splice bugs dry out and there will be a new attack vector du jour
> which people point their LLMs at. Perhaps it is worth it as a bandaid
> until a more complete solution shows up.
> 
> A full-measure solution would git rm these modules to
> fuck^W^W^W^W^W^Wprevent suspicious modules from being usable by
> unprivileged users to begin with, at least by default. Of the cases I
> had seen so far, your typical machine either does not have the thing
> loaded or in the worst case does not have a legitimate reason to
> expose them. Someone(tm) would have to figure out a sensible setup
> where stuff stops autoloading and there is a whitelist of modules
> allowed on the box, and even then the functionality provided is only
> usable by select people.
> 
> Back in the day I wrote a LSM which learned what kind of socket
> options etc. are being used in a given workload and afterwards only
> allowed that. The code never escaped $workplace, but the idea might be
> a starting point.

Shouldn't be difficult with a bpf lsm tbh. Just so we don't get any
ideas about adding yet more lsms to the kernel...

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Pedro Falcato 1 week, 1 day ago

On Sun, May 17, 2026 at 01:51:16AM +0200, Mateusz Guzik wrote:
> On Sat, May 16, 2026 at 8:21 PM Pedro Falcato <pfalcato@suse.de> wrote:
> >
> > Since the advent of vulns like Dirty Pipe, Dirty Frag, Copy Fail
> > and Fragnasia, splicing a read-only file is fundamentally unsafe.
> >
> > As such, as a mitigation, add a way for users to block splice() for
> > files they cannot write to. This eliminates this whole class of exploits
> > that use splice()+confusion in pipe/net/etc code to gain write-access to
> > files they can only read.
> >
> 
> The patch touches stuff I'm not familiar with, so no comments on that front.
> 
> The core idea is a half-measure which will at best buy few weeks until
> splice bugs dry out and there will be a new attack vector du jour
> which people point their LLMs at. Perhaps it is worth it as a bandaid
> until a more complete solution shows up.

I won't say you don't have a point. However, the whole idea is to attempt to
restrict the attack vector. This is a proven source of problems. It's known
since 2022 - now we have 3 or 4 new vulnerabilities taking advantage of this.

Clearly, the kernel isn't doing great passing these pages around the kernel.
The API isn't the easiest to use (much less safely). Can those be tackled? Yes.
Today? Definitely not. In a few years? Also doubtful.

Merging something like this (or equivalent, that allows for a restriction) I
think is simply common sense. Basic hardening of interfaces so if anything
gets discovered it's much less severe. Even if nothing else related to splice()
gets discovered, I think it's still useful to have secure by default, minimally
intrusive hardening in the kernel.

One example: do you really need to be able to send SUID binaries in a zero-copy
super efficient way around the kernel? Not really. But if something fscks it up,
you get the world's easiest LPE.

(not that today they couldn't find some random UAF and attempt to exploit this,
since all of the page cache is mapped, but there are concurrent efforts e.g ASI
that attempt to tackle this)

> 
> A full-measure solution would git rm these modules to
> fuck^W^W^W^W^W^Wprevent suspicious modules from being usable by
> unprivileged users to begin with, at least by default. Of the cases I

Sadly, for the networking case it's pretty important to have autoloading
for protocol modules. Also, for the module in question (for Fragnesia it was
ESP), that's really not that arcane. It's just a piece of code that gets
confused as to whether or not it owns (versus "shares", like it should) the
pages it's looking at.

With fragnesia it's been found out that there are _several_ important spots
in the networking stack that aren't getting this shared flag propagation
correctly.

> had seen so far, your typical machine either does not have the thing
> loaded or in the worst case does not have a legitimate reason to
> expose them. Someone(tm) would have to figure out a sensible setup
> where stuff stops autoloading and there is a whitelist of modules
> allowed on the box, and even then the functionality provided is only
> usable by select people.
> 
> Back in the day I wrote a LSM which learned what kind of socket
> options etc. are being used in a given workload and afterwards only
> allowed that. The code never escaped $workplace, but the idea might be
> a starting point.

It's beyond my skillset to comment whether or not that works well in
practice :)

> 
> All that aside, is someone with access to many tokens(tm) racing the
> bad guys vs splice bugs? Realistically one has to assume there are

I would assume it's hard to race the bad guys (the typical security problem).
You get a great zero-day as a blackhat and (AIUI) you might be looking at
tens-to-even-hundreds of thousands in rewards.

But, again, not really my field of expertise :v

> some more waiting (as suggested by the patch) and chances are decent
> nobody has to be caught with their pants down even without the
> aforementioned mitigation.

-- 
Pedro

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Matthew Wilcox 1 week, 1 day ago

On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> +static bool may_write_to_page(struct page *page, struct address_space **plast)
> +{
> +	struct folio *folio = page_folio(page);
> +	struct address_space *mapping, *last = *plast;
> +	struct inode *inode;
> +	bool may = false;
> +
> +	if (!READ_ONCE(sysctl_splice_needs_write))
> +		return true;
> +	/*
> +	 * Always fine to write to anon folios.
> +	 */
> +	if (folio_test_anon(folio))
> +		return true;

What about KSM?  It's not something we've seen attacked yet, but it'd be
pretty nasty to be able to change a KSM page in another process.

I just got off a flight, so hopefully I'm semicoherent.

> +	mapping = READ_ONCE(folio->mapping);
> +	WARN_ON((unsigned long) mapping & FOLIO_MAPPING_FLAGS);
> +
> +	/* If it is the same (locklessly), then LGTM, proceed. */
> +	if (mapping == last)
> +		return true;
> +	/*
> +	 * Else we have to recheck with the folio lock held, for mapping
> +	 * stability. TODO: killable?

I wouldn't've thought that'd be necessary.  The folio can't be being
read because it's mapped, and we won't map a folio until it's uptodate.

> +	 */
> +	folio_lock(folio);
> +	mapping = folio_mapping(folio);

I think you're safe to just look at folio->mapping here.  You have a
refcount on the folio so it can't be freed, and I'm not sure there's a
way to transition from page cache folio to anon folio without taking a
trip through the page allocator.

> +	/* May have been truncated, etc */
> +	if (!mapping)
> +		goto out_lock;

typically we call this "out_unlock".

> +	inode = mapping->host;
> +	may = inode_owner_or_capable(&nop_mnt_idmap, inode) ||
> +	      inode_permission(&nop_mnt_idmap, inode, MAY_WRITE) == 0;
> +	if (likely(may))
> +		*plast = mapping;
> +out_lock:
> +	folio_unlock(folio);
> +	return may;
> +}

I don't have a problem with the idea, other than it's really sad we have
to do this.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Pedro Falcato 1 week, 1 day ago

On Sun, May 17, 2026 at 12:07:05AM +0100, Matthew Wilcox wrote:
> On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> > +static bool may_write_to_page(struct page *page, struct address_space **plast)
> > +{
> > +	struct folio *folio = page_folio(page);
> > +	struct address_space *mapping, *last = *plast;
> > +	struct inode *inode;
> > +	bool may = false;
> > +
> > +	if (!READ_ONCE(sysctl_splice_needs_write))
> > +		return true;
> > +	/*
> > +	 * Always fine to write to anon folios.
> > +	 */
> > +	if (folio_test_anon(folio))
> > +		return true;
> 
> What about KSM?  It's not something we've seen attacked yet, but it'd be
> pretty nasty to be able to change a KSM page in another process.

It's my understanding that only anon pages can be KSM'd, and KSM still keeps
the FOLIO_MAPPING_ANON bit set. So folio_test_anon() should still test true
for those.

> 
> I just got off a flight, so hopefully I'm semicoherent.
> 
> > +	mapping = READ_ONCE(folio->mapping);
> > +	WARN_ON((unsigned long) mapping & FOLIO_MAPPING_FLAGS);
> > +
> > +	/* If it is the same (locklessly), then LGTM, proceed. */
> > +	if (mapping == last)
> > +		return true;
> > +	/*
> > +	 * Else we have to recheck with the folio lock held, for mapping
> > +	 * stability. TODO: killable?
> 
> I wouldn't've thought that'd be necessary.  The folio can't be being
> read because it's mapped, and we won't map a folio until it's uptodate.

Makes sense, I'll avoid the trouble then.

> 
> > +	 */
> > +	folio_lock(folio);
> > +	mapping = folio_mapping(folio);
> 
> I think you're safe to just look at folio->mapping here.  You have a
> refcount on the folio so it can't be freed, and I'm not sure there's a
> way to transition from page cache folio to anon folio without taking a
> trip through the page allocator.

Yep, makes sense. I don't think there is either. The worst that can happen is
that the folio could be truncated out while we have a reference but not the lock.
I think I just used the helper for the sake of using the helper, so I'll replace
it with ->mapping.

> 
> > +	/* May have been truncated, etc */
> > +	if (!mapping)
> > +		goto out_lock;
> 
> typically we call this "out_unlock".

ACK

> 
> > +	inode = mapping->host;
> > +	may = inode_owner_or_capable(&nop_mnt_idmap, inode) ||
> > +	      inode_permission(&nop_mnt_idmap, inode, MAY_WRITE) == 0;
> > +	if (likely(may))
> > +		*plast = mapping;
> > +out_lock:
> > +	folio_unlock(folio);
> > +	return may;
> > +}
> 
> I don't have a problem with the idea, other than it's really sad we have
> to do this.

Indeed :/

-- 
Pedro

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Matthew Wilcox 1 week, 1 day ago

On Sun, May 17, 2026 at 01:59:41AM +0100, Pedro Falcato wrote:
> On Sun, May 17, 2026 at 12:07:05AM +0100, Matthew Wilcox wrote:
> > On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> > > +static bool may_write_to_page(struct page *page, struct address_space **plast)
> > > +{
> > > +	struct folio *folio = page_folio(page);
> > > +	struct address_space *mapping, *last = *plast;
> > > +	struct inode *inode;
> > > +	bool may = false;
> > > +
> > > +	if (!READ_ONCE(sysctl_splice_needs_write))
> > > +		return true;
> > > +	/*
> > > +	 * Always fine to write to anon folios.
> > > +	 */
> > > +	if (folio_test_anon(folio))
> > > +		return true;
> > 
> > What about KSM?  It's not something we've seen attacked yet, but it'd be
> > pretty nasty to be able to change a KSM page in another process.
> 
> It's my understanding that only anon pages can be KSM'd, and KSM still keeps
> the FOLIO_MAPPING_ANON bit set. So folio_test_anon() should still test true
> for those.

I think you misunderstood what I meant.  If we have a buggy user which
can write to read-only file pages, then it should also be prevented from
writing to KSM pages.

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Pedro Falcato 1 week ago

On Sun, May 17, 2026 at 02:17:18AM +0100, Matthew Wilcox wrote:
> On Sun, May 17, 2026 at 01:59:41AM +0100, Pedro Falcato wrote:
> > On Sun, May 17, 2026 at 12:07:05AM +0100, Matthew Wilcox wrote:
> > > On Sat, May 16, 2026 at 07:21:26PM +0100, Pedro Falcato wrote:
> > > > +static bool may_write_to_page(struct page *page, struct address_space **plast)
> > > > +{
> > > > +	struct folio *folio = page_folio(page);
> > > > +	struct address_space *mapping, *last = *plast;
> > > > +	struct inode *inode;
> > > > +	bool may = false;
> > > > +
> > > > +	if (!READ_ONCE(sysctl_splice_needs_write))
> > > > +		return true;
> > > > +	/*
> > > > +	 * Always fine to write to anon folios.
> > > > +	 */
> > > > +	if (folio_test_anon(folio))
> > > > +		return true;
> > > 
> > > What about KSM?  It's not something we've seen attacked yet, but it'd be
> > > pretty nasty to be able to change a KSM page in another process.
> > 
> > It's my understanding that only anon pages can be KSM'd, and KSM still keeps
> > the FOLIO_MAPPING_ANON bit set. So folio_test_anon() should still test true
> > for those.
> 
> I think you misunderstood what I meant.  If we have a buggy user which
> can write to read-only file pages, then it should also be prevented from
> writing to KSM pages.

Hmm, I see. Are you suggesting we unshare KSM pages here? Or just straight
up reject them?

Rejecting would be relatively sane if only we had access to the VMA here
(in normal GUP), testing on folio_test_ksm() is less robust :/

-- 
Pedro

Re: [RFC PATCH] fs/splice: allow for a way to block splice() with read-only files

Posted by Matthew Wilcox 1 week ago

On Sun, May 17, 2026 at 10:01:30AM +0100, Pedro Falcato wrote:
> On Sun, May 17, 2026 at 02:17:18AM +0100, Matthew Wilcox wrote:
> > If we have a buggy user which
> > can write to read-only file pages, then it should also be prevented from
> > writing to KSM pages.
> 
> Hmm, I see. Are you suggesting we unshare KSM pages here? Or just straight
> up reject them?
> 
> Rejecting would be relatively sane if only we had access to the VMA here
> (in normal GUP), testing on folio_test_ksm() is less robust :/

I think we have to unshare?  As I understand KSM, it's done to a task,
so it wouldn't be aware that it's done something potentially dangerous
(unlike mapping a read-only file then splicing from it).  Also, it'll be
non-deterministic whether any given splice might fail.

Bleh.  Maybe just declare KSM to be vulnerable.