Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

[RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Mickaël Salaün 5 months, 2 weeks ago

Hi,

Script interpreters can check if a file would be allowed to be executed
by the kernel using the new AT_EXECVE_CHECK flag. This approach works
well on systems with write-xor-execute policies, where scripts cannot
be modified by malicious processes. However, this protection may not be
available on more generic distributions.

The key difference between `./script.sh` and `sh script.sh` (when using
AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
for writing while it's being executed. To achieve parity, the kernel
should provide a mechanism for script interpreters to deny write access
during script interpretation. While interpreters can copy script content
into a buffer, a race condition remains possible after AT_EXECVE_CHECK.

This patch series introduces a new O_DENY_WRITE flag for use with
open*(2) and fcntl(2). Both interfaces are necessary since script
interpreters may receive either a file path or file descriptor. For
backward compatibility, open(2) with O_DENY_WRITE will not fail on
unsupported systems, while users requiring explicit support guarantees
can use openat2(2).

The check_exec.rst documentation and related examples do not mention this new
feature yet.

Regards,

Mickaël Salaün (2):
  fs: Add O_DENY_WRITE
  selftests/exec: Add O_DENY_WRITE tests

 fs/fcntl.c                                |  26 ++-
 fs/file_table.c                           |   2 +
 fs/namei.c                                |   6 +
 include/linux/fcntl.h                     |   2 +-
 include/uapi/asm-generic/fcntl.h          |   4 +
 tools/testing/selftests/exec/check-exec.c | 219 ++++++++++++++++++++++
 6 files changed, 256 insertions(+), 3 deletions(-)


base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
-- 
2.50.1

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Christian Brauner 5 months, 2 weeks ago

On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote:
> Hi,
> 
> Script interpreters can check if a file would be allowed to be executed
> by the kernel using the new AT_EXECVE_CHECK flag. This approach works
> well on systems with write-xor-execute policies, where scripts cannot
> be modified by malicious processes. However, this protection may not be
> available on more generic distributions.
> 
> The key difference between `./script.sh` and `sh script.sh` (when using
> AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
> for writing while it's being executed. To achieve parity, the kernel
> should provide a mechanism for script interpreters to deny write access
> during script interpretation. While interpreters can copy script content
> into a buffer, a race condition remains possible after AT_EXECVE_CHECK.
> 
> This patch series introduces a new O_DENY_WRITE flag for use with
> open*(2) and fcntl(2). Both interfaces are necessary since script
> interpreters may receive either a file path or file descriptor. For
> backward compatibility, open(2) with O_DENY_WRITE will not fail on
> unsupported systems, while users requiring explicit support guarantees
> can use openat2(2).

We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff
before and you've been told by Linus as well that this is a nogo.

Nothing has changed in that regard and I'm not interested in stuffing
the VFS APIs full of special-purpose behavior to work around the fact
that this is work that needs to be done in userspace. Change the apps,
stop pushing more and more cruft into the VFS that has no business
there.

That's before we get into all the issues that are introduced by this
mechanism that magically makes arbitrary files unwritable. It's not just
a DoS it's likely to cause breakage in userspace as well. I removed the
deny-write from execve because it already breaks various use-cases or
leads to spurious failures in e.g., go. We're not spreading this disease
as a first-class VFS API.

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Mickaël Salaün 5 months, 2 weeks ago

On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote:
> > Hi,
> > 
> > Script interpreters can check if a file would be allowed to be executed
> > by the kernel using the new AT_EXECVE_CHECK flag. This approach works
> > well on systems with write-xor-execute policies, where scripts cannot
> > be modified by malicious processes. However, this protection may not be
> > available on more generic distributions.
> > 
> > The key difference between `./script.sh` and `sh script.sh` (when using
> > AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
> > for writing while it's being executed. To achieve parity, the kernel
> > should provide a mechanism for script interpreters to deny write access
> > during script interpretation. While interpreters can copy script content
> > into a buffer, a race condition remains possible after AT_EXECVE_CHECK.
> > 
> > This patch series introduces a new O_DENY_WRITE flag for use with
> > open*(2) and fcntl(2). Both interfaces are necessary since script
> > interpreters may receive either a file path or file descriptor. For
> > backward compatibility, open(2) with O_DENY_WRITE will not fail on
> > unsupported systems, while users requiring explicit support guarantees
> > can use openat2(2).
> 
> We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff
> before and you've been told by Linus as well that this is a nogo.

Oh, please, don't mix up everything.  First, this is an RFC, and as I
explained, the goal is to start a discussion with something concrete.
Second, doing a one-time check on a file and providing guarantees for
the whole lifetime of an opened file requires different approaches,
hence this O_ *proposal*.

> 
> Nothing has changed in that regard and I'm not interested in stuffing
> the VFS APIs full of special-purpose behavior to work around the fact
> that this is work that needs to be done in userspace. Change the apps,
> stop pushing more and more cruft into the VFS that has no business
> there.

It would be interesting to know how to patch user space to get the same
guarantees...  Do you think I would propose a kernel patch otherwise?

> 
> That's before we get into all the issues that are introduced by this
> mechanism that magically makes arbitrary files unwritable. It's not just
> a DoS it's likely to cause breakage in userspace as well. I removed the
> deny-write from execve because it already breaks various use-cases or
> leads to spurious failures in e.g., go. We're not spreading this disease
> as a first-class VFS API.

Jann explained it very well, and the deny-write for execve is still
there, but let's keep it civil.  I already agreed that this is not a
good approach, but we could get interesting proposals.

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Aleksa Sarai 5 months, 2 weeks ago

On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > Nothing has changed in that regard and I'm not interested in stuffing
> > the VFS APIs full of special-purpose behavior to work around the fact
> > that this is work that needs to be done in userspace. Change the apps,
> > stop pushing more and more cruft into the VFS that has no business
> > there.
> 
> It would be interesting to know how to patch user space to get the same
> guarantees...  Do you think I would propose a kernel patch otherwise?

You could mmap the script file with MAP_PRIVATE. This is the *actual*
protection the kernel uses against overwriting binaries (yes, ETXTBSY is
nice but IIRC there are ways to get around it anyway). Of course, most
interpreters don't mmap their scripts, but this is a potential solution.
If the security policy is based on validating the script text in some
way, this avoids the TOCTOU.

Now, in cases where you have IMA or something and you only permit signed
binaries to execute, you could argue there is a different race here (an
attacker creates a malicious script, runs it, and then replaces it with
a valid script's contents and metadata after the fact to get
AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
this is even possible with IMA (can an unprivileged user even set
security.ima?). But even then, I would expect users that really need
this would also probably use fs-verity or dm-verity that would block
this kind of attack since it would render the files read-only anyway.

This is why a more detailed threat model of what kinds of attacks are
relevant is useful. I was there for the talk you gave and subsequent
discussion at last year's LPC, but I felt that your threat model was
not really fleshed out at all. I am still not sure what capabilities you
expect the attacker to have nor what is being used to authenticate
binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above
assumptions, but I can't know without knowing what threat model you have
in mind, *in detail*.

For example, if you are dealing with an attacker that has CAP_SYS_ADMIN,
there are plenty of ways for an attacker to execute their own code
without using interpreters (create a new tmpfs with fsopen(2) for
instance). Executable memfds are even easier and don't require
privileges on most systems (yes, you can block them with vm.memfd_noexec
but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or
mount(2)).

(As an aside, it's a shame that AT_EXECVE_CHECK burned one of the
top-level AT_* bits for a per-syscall flag -- the block comment I added
in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be
allocated") was meant to avoid this happening but it seems you and the
reviewers missed that...)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Roberto Sassu 5 months, 1 week ago

On Thu, 2025-08-28 at 10:14 +1000, Aleksa Sarai wrote:
> On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > Nothing has changed in that regard and I'm not interested in stuffing
> > > the VFS APIs full of special-purpose behavior to work around the fact
> > > that this is work that needs to be done in userspace. Change the apps,
> > > stop pushing more and more cruft into the VFS that has no business
> > > there.
> > 
> > It would be interesting to know how to patch user space to get the same
> > guarantees...  Do you think I would propose a kernel patch otherwise?
> 
> You could mmap the script file with MAP_PRIVATE. This is the *actual*
> protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> nice but IIRC there are ways to get around it anyway). Of course, most
> interpreters don't mmap their scripts, but this is a potential solution.
> If the security policy is based on validating the script text in some
> way, this avoids the TOCTOU.
> 
> Now, in cases where you have IMA or something and you only permit signed
> binaries to execute, you could argue there is a different race here (an
> attacker creates a malicious script, runs it, and then replaces it with
> a valid script's contents and metadata after the fact to get
> AT_EXECVE_CHECK to permit the execution). However, I'm not sure that

Uhm, let's consider measurement, I'm more familiar with.

I think the race you wanted to express was that the attacker replaces
the good script, verified with AT_EXECVE_CHECK, with the bad script
after the IMA verification but before the interpreter reads it.

Fortunately, IMA is able to cope with this situation, since this race
can happen for any file open, where of course a file can be not read-
locked.

If the attacker tries to concurrently open the script for write in this
race window, IMA will report this event (called violation) in the
measurement list, and during remote attestation it will be clear that
the interpreter did not read what was measured.

We just need to run the violation check for the BPRM_CHECK hook too
(then, probably for us the O_DENY_WRITE flag or alternative solution
would not be needed, for measurement).

Please, let us know when you apply patches like 2a010c412853 ("fs:
don't block i_writecount during exec"). We had a discussion [1], but
probably I missed when it was decided to be applied (I saw now it was
in the same thread, but didn't get that at the time). We would have
needed to update our code accordingly. In the future, we will try to
clarify better our expectations from the VFS.

Thanks

Roberto

[1]: https://lore.kernel.org/linux-fsdevel/88d5a92379755413e1ec3c981d9a04e6796da110.camel@huaweicloud.com/#t

> this is even possible with IMA (can an unprivileged user even set
> security.ima?). But even then, I would expect users that really need
> this would also probably use fs-verity or dm-verity that would block
> this kind of attack since it would render the files read-only anyway.
> 
> This is why a more detailed threat model of what kinds of attacks are
> relevant is useful. I was there for the talk you gave and subsequent
> discussion at last year's LPC, but I felt that your threat model was
> not really fleshed out at all. I am still not sure what capabilities you
> expect the attacker to have nor what is being used to authenticate
> binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above
> assumptions, but I can't know without knowing what threat model you have
> in mind, *in detail*.
> 
> For example, if you are dealing with an attacker that has CAP_SYS_ADMIN,
> there are plenty of ways for an attacker to execute their own code
> without using interpreters (create a new tmpfs with fsopen(2) for
> instance). Executable memfds are even easier and don't require
> privileges on most systems (yes, you can block them with vm.memfd_noexec
> but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or
> mount(2)).
> 
> (As an aside, it's a shame that AT_EXECVE_CHECK burned one of the
> top-level AT_* bits for a per-syscall flag -- the block comment I added
> in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be
> allocated") was meant to avoid this happening but it seems you and the
> reviewers missed that...)
>

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Andy Lutomirski 5 months, 1 week ago

Can you clarify this a bit for those of us who are not well-versed in
exactly what "measurement" does?

On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
<roberto.sassu@huaweicloud.com> wrote:
> > Now, in cases where you have IMA or something and you only permit signed
> > binaries to execute, you could argue there is a different race here (an
> > attacker creates a malicious script, runs it, and then replaces it with
> > a valid script's contents and metadata after the fact to get
> > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
>
> Uhm, let's consider measurement, I'm more familiar with.
>
> I think the race you wanted to express was that the attacker replaces
> the good script, verified with AT_EXECVE_CHECK, with the bad script
> after the IMA verification but before the interpreter reads it.
>
> Fortunately, IMA is able to cope with this situation, since this race
> can happen for any file open, where of course a file can be not read-
> locked.

I assume you mean that this has nothing specifically to do with
scripts, as IMA tries to protect ordinary (non-"execute" file access)
as well.  Am I right?

>
> If the attacker tries to concurrently open the script for write in this
> race window, IMA will report this event (called violation) in the
> measurement list, and during remote attestation it will be clear that
> the interpreter did not read what was measured.
>
> We just need to run the violation check for the BPRM_CHECK hook too
> (then, probably for us the O_DENY_WRITE flag or alternative solution
> would not be needed, for measurement).

This seems consistent with my interpretation above, but ...

>
> Please, let us know when you apply patches like 2a010c412853 ("fs:
> don't block i_writecount during exec"). We had a discussion [1], but
> probably I missed when it was decided to be applied (I saw now it was
> in the same thread, but didn't get that at the time). We would have
> needed to update our code accordingly. In the future, we will try to
> clarify better our expectations from the VFS.

... I didn't follow this.

Suppose there's some valid contents of /bin/sleep.  I execute
/bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
for write, not by replacing it), and the kernel in question doesn't do
ETXTBSY.  Then the sleep process reads (and executes) the modified
contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
needed?

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Roberto Sassu 5 months, 1 week ago

On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote:
> Can you clarify this a bit for those of us who are not well-versed in
> exactly what "measurement" does?
> 
> On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
> <roberto.sassu@huaweicloud.com> wrote:
> > > Now, in cases where you have IMA or something and you only permit signed
> > > binaries to execute, you could argue there is a different race here (an
> > > attacker creates a malicious script, runs it, and then replaces it with
> > > a valid script's contents and metadata after the fact to get
> > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
> > 
> > Uhm, let's consider measurement, I'm more familiar with.
> > 
> > I think the race you wanted to express was that the attacker replaces
> > the good script, verified with AT_EXECVE_CHECK, with the bad script
> > after the IMA verification but before the interpreter reads it.
> > 
> > Fortunately, IMA is able to cope with this situation, since this race
> > can happen for any file open, where of course a file can be not read-
> > locked.
> 
> I assume you mean that this has nothing specifically to do with
> scripts, as IMA tries to protect ordinary (non-"execute" file access)
> as well.  Am I right?

Yes, correct, violations are checked for all open() and mmap()
involving regular files. It would not be special to do it for scripts.

> > If the attacker tries to concurrently open the script for write in this
> > race window, IMA will report this event (called violation) in the
> > measurement list, and during remote attestation it will be clear that
> > the interpreter did not read what was measured.
> > 
> > We just need to run the violation check for the BPRM_CHECK hook too
> > (then, probably for us the O_DENY_WRITE flag or alternative solution
> > would not be needed, for measurement).
> 
> This seems consistent with my interpretation above, but ...

The comment here [1] seems to be clear on why the violation check it is
not done for execution (BPRM_CHECK hook). Since the OS read-locks the
files during execution, this implicitly guarantees that there will not
be concurrent writes, and thus no IMA violations.

However, recently, we took advantage of AT_EXECVE_CHECK to also
evaluate the integrity of scripts (when not executed via ./). Since we
are using the same hook for both executed files (read-locked) and
scripts (I guess non-read-locked), then we need to do a violation check
for BPRM_CHECK too, although it will be redundant for the first
category.

> > Please, let us know when you apply patches like 2a010c412853 ("fs:
> > don't block i_writecount during exec"). We had a discussion [1], but
> > probably I missed when it was decided to be applied (I saw now it was
> > in the same thread, but didn't get that at the time). We would have
> > needed to update our code accordingly. In the future, we will try to
> > clarify better our expectations from the VFS.
> 
> ... I didn't follow this.
> 
> Suppose there's some valid contents of /bin/sleep.  I execute
> /bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
> for write, not by replacing it), and the kernel in question doesn't do
> ETXTBSY.  Then the sleep process reads (and executes) the modified
> contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
> needed?

Ok, this is actually a good opportunity to explain what it will be
missing. If you do the operations in the order you proposed, actually a
violation will be emitted, because the violating operation is an open()
and the check is done for this system call.

However, if you do the opposite, first open for write and then
execution, IMA will not be aware of that since it trusts the OS to not
make it happen and will not check for violations.

So yes, in your case the remote attestation will fail (actually it is
up to the remote verifier to decide...). But in the opposite case, the
writer could wait for IMA to measure the genuine content and then
modify the content conveniently. The remote attestation will succeed.

Adding the violation check on BPRM_CHECK should be sufficient to avoid
such situation, but I would try to think if there are other
implications for IMA of not read-locking the files on execution.

Roberto

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Roberto Sassu 5 months, 1 week ago

On Mon, 2025-09-01 at 19:01 +0200, Roberto Sassu wrote:
> On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote:
> > Can you clarify this a bit for those of us who are not well-versed in
> > exactly what "measurement" does?

Ah, sorry, I missed that.

Measurement refers to the process of collecting the file digest and
storing it in the measurement list, as opposed to appraisal which
instead compares the collected file digest with a reference value
(assumed to be good), and denies access in case of a mismatch.

Integrity violations are detected and reported only for measurement.

Roberto

> > On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
> > <roberto.sassu@huaweicloud.com> wrote:
> > > > Now, in cases where you have IMA or something and you only permit signed
> > > > binaries to execute, you could argue there is a different race here (an
> > > > attacker creates a malicious script, runs it, and then replaces it with
> > > > a valid script's contents and metadata after the fact to get
> > > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
> > > 
> > > Uhm, let's consider measurement, I'm more familiar with.
> > > 
> > > I think the race you wanted to express was that the attacker replaces
> > > the good script, verified with AT_EXECVE_CHECK, with the bad script
> > > after the IMA verification but before the interpreter reads it.
> > > 
> > > Fortunately, IMA is able to cope with this situation, since this race
> > > can happen for any file open, where of course a file can be not read-
> > > locked.
> > 
> > I assume you mean that this has nothing specifically to do with
> > scripts, as IMA tries to protect ordinary (non-"execute" file access)
> > as well.  Am I right?
> 
> Yes, correct, violations are checked for all open() and mmap()
> involving regular files. It would not be special to do it for scripts.
> 
> > > If the attacker tries to concurrently open the script for write in this
> > > race window, IMA will report this event (called violation) in the
> > > measurement list, and during remote attestation it will be clear that
> > > the interpreter did not read what was measured.
> > > 
> > > We just need to run the violation check for the BPRM_CHECK hook too
> > > (then, probably for us the O_DENY_WRITE flag or alternative solution
> > > would not be needed, for measurement).
> > 
> > This seems consistent with my interpretation above, but ...
> 
> The comment here [1] seems to be clear on why the violation check it is
> not done for execution (BPRM_CHECK hook). Since the OS read-locks the
> files during execution, this implicitly guarantees that there will not
> be concurrent writes, and thus no IMA violations.
> 
> However, recently, we took advantage of AT_EXECVE_CHECK to also
> evaluate the integrity of scripts (when not executed via ./). Since we
> are using the same hook for both executed files (read-locked) and
> scripts (I guess non-read-locked), then we need to do a violation check
> for BPRM_CHECK too, although it will be redundant for the first
> category.
> 
> > > Please, let us know when you apply patches like 2a010c412853 ("fs:
> > > don't block i_writecount during exec"). We had a discussion [1], but
> > > probably I missed when it was decided to be applied (I saw now it was
> > > in the same thread, but didn't get that at the time). We would have
> > > needed to update our code accordingly. In the future, we will try to
> > > clarify better our expectations from the VFS.
> > 
> > ... I didn't follow this.
> > 
> > Suppose there's some valid contents of /bin/sleep.  I execute
> > /bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
> > for write, not by replacing it), and the kernel in question doesn't do
> > ETXTBSY.  Then the sleep process reads (and executes) the modified
> > contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
> > needed?
> 
> Ok, this is actually a good opportunity to explain what it will be
> missing. If you do the operations in the order you proposed, actually a
> violation will be emitted, because the violating operation is an open()
> and the check is done for this system call.
> 
> However, if you do the opposite, first open for write and then
> execution, IMA will not be aware of that since it trusts the OS to not
> make it happen and will not check for violations.
> 
> So yes, in your case the remote attestation will fail (actually it is
> up to the remote verifier to decide...). But in the opposite case, the
> writer could wait for IMA to measure the genuine content and then
> modify the content conveniently. The remote attestation will succeed.
> 
> Adding the violation check on BPRM_CHECK should be sufficient to avoid
> such situation, but I would try to think if there are other
> implications for IMA of not read-locking the files on execution.
> 
> Roberto
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565
>

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Andy Lutomirski 5 months, 2 weeks ago

On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > Nothing has changed in that regard and I'm not interested in stuffing
> > > the VFS APIs full of special-purpose behavior to work around the fact
> > > that this is work that needs to be done in userspace. Change the apps,
> > > stop pushing more and more cruft into the VFS that has no business
> > > there.
> >
> > It would be interesting to know how to patch user space to get the same
> > guarantees...  Do you think I would propose a kernel patch otherwise?
>
> You could mmap the script file with MAP_PRIVATE. This is the *actual*
> protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> nice but IIRC there are ways to get around it anyway).

Wait, really?  MAP_PRIVATE prevents writes to the mapping from
affecting the file, but I don't think that writes to the file will
break the MAP_PRIVATE CoW if it's not already broken.

IPython says:

In [1]: import mmap, tempfile

In [2]: f = tempfile.TemporaryFile()

In [3]: f.write(b'initial contents')
Out[3]: 16

In [4]: f.flush()

In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
prot=mmap.PROT_READ)

In [6]: map[:]
Out[6]: b'initial contents'

In [7]: f.seek(0)
Out[7]: 0

In [8]: f.write(b'changed')
Out[8]: 7

In [9]: f.flush()

In [10]: map[:]
Out[10]: b'changed contents'

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Serge E. Hallyn 5 months, 1 week ago

On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > that this is work that needs to be done in userspace. Change the apps,
> > > > stop pushing more and more cruft into the VFS that has no business
> > > > there.
> > >
> > > It would be interesting to know how to patch user space to get the same
> > > guarantees...  Do you think I would propose a kernel patch otherwise?
> >
> > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > nice but IIRC there are ways to get around it anyway).
> 
> Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> affecting the file, but I don't think that writes to the file will
> break the MAP_PRIVATE CoW if it's not already broken.
> 
> IPython says:
> 
> In [1]: import mmap, tempfile
> 
> In [2]: f = tempfile.TemporaryFile()
> 
> In [3]: f.write(b'initial contents')
> Out[3]: 16
> 
> In [4]: f.flush()
> 
> In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> prot=mmap.PROT_READ)
> 
> In [6]: map[:]
> Out[6]: b'initial contents'
> 
> In [7]: f.seek(0)
> Out[7]: 0
> 
> In [8]: f.write(b'changed')
> Out[8]: 7
> 
> In [9]: f.flush()
> 
> In [10]: map[:]
> Out[10]: b'changed contents'

That was surprising to me, however, if I split the reader
and writer into different processes, so

P1:
f = open("/tmp/3", "w")
f.write('initial contents')
f.flush()

P2:
import mmap
f = open("/tmp/3", "r")
map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)

Back to P1:
f.seek(0)
f.write('changed')

Back to P2:
map[:]

Then P2 gives me:

b'initial contents'

-serge

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Jann Horn 5 months, 1 week ago

On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > there.
> > > >
> > > > It would be interesting to know how to patch user space to get the same
> > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > >
> > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > nice but IIRC there are ways to get around it anyway).
> >
> > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > affecting the file, but I don't think that writes to the file will
> > break the MAP_PRIVATE CoW if it's not already broken.
> >
> > IPython says:
> >
> > In [1]: import mmap, tempfile
> >
> > In [2]: f = tempfile.TemporaryFile()
> >
> > In [3]: f.write(b'initial contents')
> > Out[3]: 16
> >
> > In [4]: f.flush()
> >
> > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > prot=mmap.PROT_READ)
> >
> > In [6]: map[:]
> > Out[6]: b'initial contents'
> >
> > In [7]: f.seek(0)
> > Out[7]: 0
> >
> > In [8]: f.write(b'changed')
> > Out[8]: 7
> >
> > In [9]: f.flush()
> >
> > In [10]: map[:]
> > Out[10]: b'changed contents'
>
> That was surprising to me, however, if I split the reader
> and writer into different processes, so

Testing this in python is a terrible idea because it obfuscates the
actual syscalls from you.

> P1:
> f = open("/tmp/3", "w")
> f.write('initial contents')
> f.flush()
>
> P2:
> import mmap
> f = open("/tmp/3", "r")
> map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)
>
> Back to P1:
> f.seek(0)
> f.write('changed')
>
> Back to P2:
> map[:]
>
> Then P2 gives me:
>
> b'initial contents'

Because when you executed `f.write('changed')`, Python internally
buffered the write. "changed" is never actually written into the file
in your example. If you add a `f.flush()` in P1 after this, running
`map[:]` in P2 again will show you the new data.

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Andy Lutomirski 5 months, 1 week ago

On Mon, Sep 1, 2025 at 4:06 AM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > > there.
> > > > >
> > > > > It would be interesting to know how to patch user space to get the same
> > > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > > >
> > > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > > nice but IIRC there are ways to get around it anyway).
> > >
> > > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > > affecting the file, but I don't think that writes to the file will
> > > break the MAP_PRIVATE CoW if it's not already broken.
> > >
> > > IPython says:
> > >
> > > In [1]: import mmap, tempfile
> > >
> > > In [2]: f = tempfile.TemporaryFile()
> > >
> > > In [3]: f.write(b'initial contents')
> > > Out[3]: 16
> > >
> > > In [4]: f.flush()
> > >
> > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > > prot=mmap.PROT_READ)
> > >
> > > In [6]: map[:]
> > > Out[6]: b'initial contents'
> > >
> > > In [7]: f.seek(0)
> > > Out[7]: 0
> > >
> > > In [8]: f.write(b'changed')
> > > Out[8]: 7
> > >
> > > In [9]: f.flush()
> > >
> > > In [10]: map[:]
> > > Out[10]: b'changed contents'
> >
> > That was surprising to me, however, if I split the reader
> > and writer into different processes, so
>
> Testing this in python is a terrible idea because it obfuscates the
> actual syscalls from you.
>
> > P1:
> > f = open("/tmp/3", "w")
> > f.write('initial contents')
> > f.flush()
> >
> > P2:
> > import mmap
> > f = open("/tmp/3", "r")
> > map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)
> >
> > Back to P1:
> > f.seek(0)
> > f.write('changed')
> >
> > Back to P2:
> > map[:]
> >
> > Then P2 gives me:
> >
> > b'initial contents'
>
> Because when you executed `f.write('changed')`, Python internally
> buffered the write. "changed" is never actually written into the file
> in your example. If you add a `f.flush()` in P1 after this, running
> `map[:]` in P2 again will show you the new data.
>

These days, one can type in Python, ask an LLM to translate to C, and
get almost-correct output :)  Or one can use os.write(), which is
exactly what I should have done.

--Andy

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Serge E. Hallyn 5 months, 1 week ago

On Mon, Sep 01, 2025 at 01:05:16PM +0200, Jann Horn wrote:
> On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > > there.
> > > > >
> > > > > It would be interesting to know how to patch user space to get the same
> > > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > > >
> > > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > > nice but IIRC there are ways to get around it anyway).
> > >
> > > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > > affecting the file, but I don't think that writes to the file will
> > > break the MAP_PRIVATE CoW if it's not already broken.
> > >
> > > IPython says:
> > >
> > > In [1]: import mmap, tempfile
> > >
> > > In [2]: f = tempfile.TemporaryFile()
> > >
> > > In [3]: f.write(b'initial contents')
> > > Out[3]: 16
> > >
> > > In [4]: f.flush()
> > >
> > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > > prot=mmap.PROT_READ)
> > >
> > > In [6]: map[:]
> > > Out[6]: b'initial contents'
> > >
> > > In [7]: f.seek(0)
> > > Out[7]: 0
> > >
> > > In [8]: f.write(b'changed')
> > > Out[8]: 7
> > >
> > > In [9]: f.flush()
> > >
> > > In [10]: map[:]
> > > Out[10]: b'changed contents'
> >
> > That was surprising to me, however, if I split the reader
> > and writer into different processes, so
> 
> Testing this in python is a terrible idea because it obfuscates the
> actual syscalls from you.

Hah, I was just trying to fit in :), but of course you're right.
Redoing it in straight c, I'm getting the updates.

-serge

// mmap-w.c (creates an overwrites)
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

#define FIRST "Initial contents"
#define SECOND "updated contents"

int main() {
	int fd, rc;
	char c;

	fd = open("/tmp/m", O_CREAT | O_RDWR, 0644);
	if (fd < 0) {
		printf("failed to open /tmp/m: %m\n");
		_exit(1);
	}
	rc = write(fd, FIRST, sizeof(FIRST));
	if (rc < 0) {
		printf("write failed: %m\n");
		_exit(1);
	}
	rc = fsync(fd);
	if (rc < 0) {
		printf("flush failed: %m\n");
		_exit(1);
	}

	read(STDIN_FILENO, &c, 1);

	printf("updating the contents\n");

	rc = lseek(fd, 0, SEEK_SET);
	if (rc < 0) {
		printf("seek failed; %m\n");
		_exit(1);
	}

	rc = write(fd, SECOND, sizeof(SECOND));
	if (fd < 0) {
		printf("write failed: %m\n");
		_exit(1);
	}
	rc = close(fd);
	if (rc < 0) {
		printf("close failed: %m\n");
		_exit(1);
	}
	printf("done\n");
}

// mmap-r.c (checks and re-checks contents)
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#define FIRST "Initial contents"
#define SECOND "Updated contents"

int main() {
	int fd, rc;
	char *m;
	char c;

	fd = open("/tmp/m", O_RDONLY);
	if (fd < 0) {
		printf("failed to open /tmp/m: %m\n");
		_exit(1);
	}

	m = mmap(NULL, 40, PROT_READ, MAP_PRIVATE, fd, 0);
	if (m == MAP_FAILED) {
		printf("mmap failed: %m\n");
		_exit(1);
	}

	if (strncmp(m, FIRST, 7) != 0) {
		printf("m is %c%c%c%c%c%c%c\n",
			m[0], m[1], m[2], m[3], m[4], m[5], m[6]);
		_exit(1);
	}

	read(STDIN_FILENO, &c, 1);

	if (strncmp(m, SECOND, 7) != 0) {
		printf("m is %c%c%c%c%c%c%c\n",
			m[0], m[1], m[2], m[3], m[4], m[5], m[6]);
		_exit(1);
	}

	printf("done\n");
}

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Aleksa Sarai 5 months, 2 weeks ago

On 2025-08-27, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > that this is work that needs to be done in userspace. Change the apps,
> > > > stop pushing more and more cruft into the VFS that has no business
> > > > there.
> > >
> > > It would be interesting to know how to patch user space to get the same
> > > guarantees...  Do you think I would propose a kernel patch otherwise?
> >
> > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > nice but IIRC there are ways to get around it anyway).
> 
> Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> affecting the file, but I don't think that writes to the file will
> break the MAP_PRIVATE CoW if it's not already broken.

Oh I guess you're right -- that's news to me. And from mmap(2):

> MAP_PRIVATE
> [...] It is unspecified whether changes made to the file after the
> mmap() call are visible in the mapped region.

But then what is the protection mechanism (in the absence of -ETXTBSY)
that stops you from overwriting the live text of a binary by just
writing to it?

I would need to go trawling through my old scripts to find the
reproducer that let you get around -ETXTBSY (I think it involved
executable memfds) but I distinctly remember that even if you overwrote
the binary you would not see the live process's mapped mm change value.
(Ditto for the few kernels when we removed -ETXTBSY.) I found this
surprising, but assumed that it was because of MAP_PRIVATE.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Theodore Ts'o 5 months, 2 weeks ago

Is there a single, unified design and requirements document that
describes the threat model, and what you are trying to achieve with
AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
that has landed for AT_EXECVE_CHECK and it really doesn't describe
what *are* the checks that AT_EXECVE_CHECK is trying to achieve:

   "The AT_EXECVE_CHECK execveat(2) flag, and the
   SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
   securebits are intended for script interpreters and dynamic linkers
   to enforce a consistent execution security policy handled by the
   kernel."

Um, what security policy?  What checks?  What is a sample exploit
which is blocked by AT_EXECVE_CHECK?

And then on top of it, why can't you do these checks by modifying the
script interpreters?

Confused,

						- Ted

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Mickaël Salaün 5 months, 2 weeks ago

On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> Is there a single, unified design and requirements document that
> describes the threat model, and what you are trying to achieve with
> AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> that has landed for AT_EXECVE_CHECK and it really doesn't describe
> what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> 
>    "The AT_EXECVE_CHECK execveat(2) flag, and the
>    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
>    securebits are intended for script interpreters and dynamic linkers
>    to enforce a consistent execution security policy handled by the
>    kernel."

From the documentation:

  Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
  on a regular file and returns 0 if execution of this file would be
  allowed, ignoring the file format and then the related interpreter
  dependencies (e.g. ELF libraries, script’s shebang).

> 
> Um, what security policy?

Whether the file is allowed to be executed.  This includes file
permission, mount point option, ACL, LSM policies...

> What checks?

Executability checks?

> What is a sample exploit
> which is blocked by AT_EXECVE_CHECK?

Executing/interpreting any data: sh script.txt

> 
> And then on top of it, why can't you do these checks by modifying the
> script interpreters?

The script interpreter requires modification to use AT_EXECVE_CHECK.

There is no other way for user space to reliably check executability of
files (taking into account all enforced security
policies/configurations).

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Andy Lutomirski 5 months, 2 weeks ago

On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > Is there a single, unified design and requirements document that
> > describes the threat model, and what you are trying to achieve with
> > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> >
> >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> >    securebits are intended for script interpreters and dynamic linkers
> >    to enforce a consistent execution security policy handled by the
> >    kernel."
>
> From the documentation:
>
>   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
>   on a regular file and returns 0 if execution of this file would be
>   allowed, ignoring the file format and then the related interpreter
>   dependencies (e.g. ELF libraries, script’s shebang).
>
> >
> > Um, what security policy?
>
> Whether the file is allowed to be executed.  This includes file
> permission, mount point option, ACL, LSM policies...

This needs *waaaaay* more detail for any sort of useful evaluation.
Is an actual credible security policy rolling dice?  Asking ChatGPT?
Looking at security labels?  Does it care who can write to the file,
or who owns the file, or what the file's hash is, or what filesystem
it's on, or where it came from?  Does it dynamically inspect the
contents?  Is it controlled by an unprivileged process?

I can easily come up with security policies for which DENYWRITE is
completely useless.  I can come up with convoluted and
not-really-credible policies where DENYWRITE is important, but I'm
honestly not sure that those policies are actually useful.  I'm
honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
because it should have been parametrized by *what format is expected*
-- it might be possible to bypass a policy by executing a perfectly
fine Python script using bash, for example.

I genuinely have not come up with a security policy that I believe
makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
that such a policy does not exist -- I'm saying that I have not
thought of such a thing after a few minutes of thought and reading
these threads.

> > And then on top of it, why can't you do these checks by modifying the
> > script interpreters?
>
> The script interpreter requires modification to use AT_EXECVE_CHECK.
>
> There is no other way for user space to reliably check executability of
> files (taking into account all enforced security
> policies/configurations).
>

As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish
this goal.  If it were genuinely useful, I would much, much prefer a
totally different API: a *syscall* that takes, as input, a file
descriptor of something that an interpreter wants to execute and a
whole lot of context as to what that interpreter wants to do with it.
And I admit I'm *still* not convinced.

Seriously, consider all the unending recent attacks on LLMs an
inspiration.  The implications of viewing an image, downscaling the
image, possibly interpreting the image as something containing text,
possibly following instructions in a given language contained in the
image, etc are all wildly different.  A mechanism for asking for
general permission to "consume this image" is COMPLETELY MISSING THE
POINT.  (Never mind that the current crop of LLMs seem entirely
incapable of constraining their own use of some piece of input, but
that's a different issue and is besides the point here.)

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Mickaël Salaün 5 months, 2 weeks ago

On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > > Is there a single, unified design and requirements document that
> > > describes the threat model, and what you are trying to achieve with
> > > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> > >
> > >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> > >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> > >    securebits are intended for script interpreters and dynamic linkers
> > >    to enforce a consistent execution security policy handled by the
> > >    kernel."
> >
> > From the documentation:
> >
> >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> >   on a regular file and returns 0 if execution of this file would be
> >   allowed, ignoring the file format and then the related interpreter
> >   dependencies (e.g. ELF libraries, script’s shebang).
> >
> > >
> > > Um, what security policy?
> >
> > Whether the file is allowed to be executed.  This includes file
> > permission, mount point option, ACL, LSM policies...
> 
> This needs *waaaaay* more detail for any sort of useful evaluation.
> Is an actual credible security policy rolling dice?  Asking ChatGPT?
> Looking at security labels?  Does it care who can write to the file,
> or who owns the file, or what the file's hash is, or what filesystem
> it's on, or where it came from?  Does it dynamically inspect the
> contents?  Is it controlled by an unprivileged process?

AT_EXECVE_CHECK only does the same checks as done by other execveat(2)
calls, but without actually executing the file/fd.

> 
> I can easily come up with security policies for which DENYWRITE is
> completely useless.  I can come up with convoluted and
> not-really-credible policies where DENYWRITE is important, but I'm
> honestly not sure that those policies are actually useful.  I'm
> honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
> because it should have been parametrized by *what format is expected*
> -- it might be possible to bypass a policy by executing a perfectly
> fine Python script using bash, for example.

There have been a lot of bikesheding for the AT_EXECVE_CHECK patch
series, and a lot of discussions too (you where part of them).  We ended
up with this design, which is simple and follows the kernel semantic
(requested by Linus).

> 
> I genuinely have not come up with a security policy that I believe
> makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
> that such a policy does not exist -- I'm saying that I have not
> thought of such a thing after a few minutes of thought and reading
> these threads.

A simple use case is for systems that wants to enforce a
write-xor-execute policy e.g., thanks to mount point options.

> 
> 
> > > And then on top of it, why can't you do these checks by modifying the
> > > script interpreters?
> >
> > The script interpreter requires modification to use AT_EXECVE_CHECK.
> >
> > There is no other way for user space to reliably check executability of
> > files (taking into account all enforced security
> > policies/configurations).
> >
> 
> As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish
> this goal.  If it were genuinely useful, I would much, much prefer a
> totally different API: a *syscall* that takes, as input, a file
> descriptor of something that an interpreter wants to execute and a
> whole lot of context as to what that interpreter wants to do with it.
> And I admit I'm *still* not convinced.

As mentioned above, AT_EXECVE_CHECK follows the kernel semantic. Nothing
fancy.

> 
> Seriously, consider all the unending recent attacks on LLMs an
> inspiration.  The implications of viewing an image, downscaling the
> image, possibly interpreting the image as something containing text,
> possibly following instructions in a given language contained in the
> image, etc are all wildly different.  A mechanism for asking for
> general permission to "consume this image" is COMPLETELY MISSING THE
> POINT.  (Never mind that the current crop of LLMs seem entirely
> incapable of constraining their own use of some piece of input, but
> that's a different issue and is besides the point here.)

You're asking about what should we consider executable.  This is a good
question, but AT_EXECVE_CHECK is there to answer another question: would
the kernel execute it or not?

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Andy Lutomirski 5 months, 2 weeks ago

On Wed, Aug 27, 2025 at 12:07 PM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote:
> > On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > > > Is there a single, unified design and requirements document that
> > > > describes the threat model, and what you are trying to achieve with
> > > > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > > > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> > > >
> > > >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> > > >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> > > >    securebits are intended for script interpreters and dynamic linkers
> > > >    to enforce a consistent execution security policy handled by the
> > > >    kernel."
> > >
> > > From the documentation:
> > >
> > >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> > >   on a regular file and returns 0 if execution of this file would be
> > >   allowed, ignoring the file format and then the related interpreter
> > >   dependencies (e.g. ELF libraries, script’s shebang).
> > >
> > > >
> > > > Um, what security policy?
> > >
> > > Whether the file is allowed to be executed.  This includes file
> > > permission, mount point option, ACL, LSM policies...
> >
> > This needs *waaaaay* more detail for any sort of useful evaluation.
> > Is an actual credible security policy rolling dice?  Asking ChatGPT?
> > Looking at security labels?  Does it care who can write to the file,
> > or who owns the file, or what the file's hash is, or what filesystem
> > it's on, or where it came from?  Does it dynamically inspect the
> > contents?  Is it controlled by an unprivileged process?
>
> AT_EXECVE_CHECK only does the same checks as done by other execveat(2)
> calls, but without actually executing the file/fd.
>

okay... but see below.

> >
> > I can easily come up with security policies for which DENYWRITE is
> > completely useless.  I can come up with convoluted and
> > not-really-credible policies where DENYWRITE is important, but I'm
> > honestly not sure that those policies are actually useful.  I'm
> > honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
> > because it should have been parametrized by *what format is expected*
> > -- it might be possible to bypass a policy by executing a perfectly
> > fine Python script using bash, for example.
>
> There have been a lot of bikesheding for the AT_EXECVE_CHECK patch
> series, and a lot of discussions too (you where part of them).  We ended
> up with this design, which is simple and follows the kernel semantic
> (requested by Linus).

I recall this.  That doesn't mean I totally love AT_EXECVE_CHECK.  And
it especially doesn't mean that I believe that it usefully does
something that justifies anything like DENYWRITE.

>
> >
> > I genuinely have not come up with a security policy that I believe
> > makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
> > that such a policy does not exist -- I'm saying that I have not
> > thought of such a thing after a few minutes of thought and reading
> > these threads.
>
> A simple use case is for systems that wants to enforce a
> write-xor-execute policy e.g., thanks to mount point options.

Sure, but I'm contemplating DENYWRITE, and this thread is about
DENYWRITE.  If the kernel is enforcing W^X, then there are really two
almost unrelated things going on:

1. LSM policy that enforces W^X for memory mappings.  This is to
enforce that applications don't do nasty things like having executable
stacks, and it's a mess because no one has really figured out how JITs
are supposed to work in this world.  It has almost nothing to do with
execve except incidentally.

2. LSM policy that enforces that someone doesn't execve (or similar)
something that *that user* can write.  Or that non-root can write.  Or
that anyone at all can write, etc.

I think, but I'm not sure, that you're talking about #2.  So maybe
there's a policy that says that one may only exec things that are on
an fs with the 'exec' mount option.  Or maybe there's a policy that
says that one may only exec things that are on a readonly fs.  In
these specific cases, I believe in AT_EXECVE_CHECK.  *But* I don't
believe in DENYWRITE: in the 'exec' case, if an fs has the exec option
set, that doesn't change if the file is subsequently modified.  And if
an fs is readonly, then the file is quite unlikely to be modified at
all and will certainly not be modified via the mount through which
it's being executed.  And you don't need DENYWRITE.

So I think my question still stands: is there a credible security
policy *that actually benefits from DENYWRITE*?  If so, can you give
an example?

> >
> > Seriously, consider all the unending recent attacks on LLMs an
> > inspiration.  The implications of viewing an image, downscaling the
> > image, possibly interpreting the image as something containing text,
> > possibly following instructions in a given language contained in the
> > image, etc are all wildly different.  A mechanism for asking for
> > general permission to "consume this image" is COMPLETELY MISSING THE
> > POINT.  (Never mind that the current crop of LLMs seem entirely
> > incapable of constraining their own use of some piece of input, but
> > that's a different issue and is besides the point here.)
>
> You're asking about what should we consider executable.  This is a good
> question, but AT_EXECVE_CHECK is there to answer another question: would
> the kernel execute it or not?
>

That's a sort of odd way of putting it.  The kernel won't execute it
because the kernel doesn't know how to :)  But I think I understand
what you're saying.

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Theodore Ts'o 5 months, 2 weeks ago

On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote:
> 
>   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
>   on a regular file and returns 0 if execution of this file would be
>   allowed, ignoring the file format and then the related interpreter
>   dependencies (e.g. ELF libraries, script’s shebang).

But if that's it, why can't the script interpreter (python, bash,
etc.) before executing the script, checks for executability via
faccessat(2) or fstat(2)?

The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK
was doing more than just the executability check?

> There is no other way for user space to reliably check executability of
> files (taking into account all enforced security
> policies/configurations).

Why doesn't faccessat(2) or fstat(2) suffice?  This is why having a
more substantive requirements and design doc might be helpful.  It
appears you have some assumptions that perhaps other kernel developers
are not aware.  I certainly seem to be missing something.....

    		  	    	    - Ted

Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)

Posted by Mickaël Salaün 5 months, 2 weeks ago

On Tue, Aug 26, 2025 at 04:50:57PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote:
> > 
> >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> >   on a regular file and returns 0 if execution of this file would be
> >   allowed, ignoring the file format and then the related interpreter
> >   dependencies (e.g. ELF libraries, script’s shebang).
> 
> But if that's it, why can't the script interpreter (python, bash,
> etc.) before executing the script, checks for executability via
> faccessat(2) or fstat(2)?

From commit a5874fde3c08 ("exec: Add a new AT_EXECVE_CHECK flag to
execveat(2)"):

    This is different from faccessat(2) + X_OK which only checks a subset of
    access rights (i.e. inode permission and mount options for regular
    files), but not the full context (e.g. all LSM access checks).  The main
    use case for access(2) is for SUID processes to (partially) check access
    on behalf of their caller.  The main use case for execveat(2) +
    AT_EXECVE_CHECK is to check if a script execution would be allowed,
    according to all the different restrictions in place.  Because the use
    of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
    execution, user space gets the same error codes.


> 
> The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK
> was doing more than just the executability check?

I would say that that AT_EXECVE_CHECK does a full executability check
(with the full caller's credentials checked against the currently
enforced security policy).

The rationale to add O_DENY_WRITE (which is now abandoned) was to avoid a race
condition between the check and the full read.  Indeed, with a full
execveat(2), the kernel write-lock the file to avoid such issue (which can lead
to other issues).

> 
> > There is no other way for user space to reliably check executability of
> > files (taking into account all enforced security
> > policies/configurations).
> 
> Why doesn't faccessat(2) or fstat(2) suffice?  This is why having a
> more substantive requirements and design doc might be helpful.  It
> appears you have some assumptions that perhaps other kernel developers
> are not aware.  I certainly seem to be missing something.....

My reasoning was to explain the rationale for a kernel feature in the commit
message, and the user doc (why and how to use it) in the user-facing
documentation.  Documentation improvements are welcome!