[RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify

Cong Wang posted 3 patches 4 weeks ago
.../userspace-api/seccomp_filter.rst          |  42 ++
MAINTAINERS                                   |   2 +
include/uapi/linux/seccomp.h                  |  65 +++
kernel/Makefile                               |   1 +
kernel/seccomp.c                              | 121 ++++-
kernel/seccomp_inject.c                       | 281 ++++++++++++
kernel/seccomp_inject.h                       |  65 +++
tools/testing/selftests/seccomp/.gitignore    |   1 +
tools/testing/selftests/seccomp/Makefile      |   2 +-
.../selftests/seccomp/seccomp_notif_inject.c  | 434 ++++++++++++++++++
10 files changed, 1011 insertions(+), 3 deletions(-)
create mode 100644 kernel/seccomp_inject.c
create mode 100644 kernel/seccomp_inject.h
create mode 100644 tools/testing/selftests/seccomp/seccomp_notif_inject.c
[RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
Posted by Cong Wang 4 weeks ago
From: Cong Wang <cwang@multikernel.io>

This is a complete rework of v1 (PIN_ARGS), reshaped to address the
review feedback that having every syscall-arg fetch site consult a
per-task pin pointer is cross-cutting awareness that does not scale.

v1 thread:
https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@gmail.com

## Changes since v1

The previous proposal (SECCOMP_IOCTL_NOTIF_PIN_ARGS) snapshotted
pointer-arg payloads into kernel buffers and modified four syscall
fetch sites (getname_flags in fs/namei.c, copy_strings in fs/exec.c,
move_addr_to_kernel in net/socket.c, import_ubuf in lib/iov_iter.c
plus new_sync_read/new_sync_write in fs/read_write.c) so the resumed
syscall body would consume from the snapshot instead of re-reading
user memory. The reviewer correctly pointed out that this spreads
"continue-from-snapshotted-state" awareness across the VFS and the
kernel in general, and that the right shape for this kind of feature
is one where the syscall layer does not have to care.

v2 inverts the model. The supervisor no longer pins args for a
resumed syscall body to consume; it describes a substitute syscall
(nr + args[6]) whose pointer-shaped args are encoded as byte offsets
into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
trapped task wakes inside seccomp_do_user_notification(), dispatches
into a kernel-mode syscall helper (filp_open / kernel_bind /
kernel_write for v1), and the helper's return value becomes the
trapped syscall's return value. The trapped task's user mm is never
re-read for the substituted syscall.

Footprint comparison:

  v1 (PIN_ARGS):  ~1000 LOC across kernel/, fs/, mm/, net/, lib/
  v2 (INJECT):    ~540  LOC, all in kernel/seccomp_inject.{c,h} plus
                  a small dispatcher in kernel/seccomp.c. fs/, mm/,
                  net/, lib/ are unmodified.

## Motivation (short version)

seccomp_unotify(2) leaves a TOCTOU window: a sibling thread or
CLONE_VM peer can mutate pointer-arg buffers between the supervisor's
process_vm_readv() and the kernel's re-read on
SECCOMP_USER_NOTIF_FLAG_CONTINUE. The race is documented in the man
page. ptrace and /proc/pid/mem are unavailable to unprivileged
supervisors, so today there is no race-free path for content-aware
allow policy on CONTINUE.

The full motivation, including the threat model (adversarial AI
agents in the same address space) and the concrete user (Sandlock,
https://github.com/multikernel/sandlock), is in the v1 cover letter
above.

## Approach

The UAPI is one new ioctl and one new NOTIF_SEND flag:

    struct seccomp_notif_inject {
        __u64 id;
        __u64 nr;
        __u64 args[6];
        __u64 buf;               /* __user, kernel-input bytes */
        __u32 buf_size;
        __u32 args_in_buf_mask;  /* bit i: args[i] is offset into buf */
    };

    #define SECCOMP_IOCTL_NOTIF_INJECT \
        SECCOMP_IOW(5, struct seccomp_notif_inject)
    #define SECCOMP_USER_NOTIF_FLAG_INJECTED (1U << 1)

The struct mirrors ptrace_syscall_info.entry (nr + args[6]); pointer
args are encoded as byte offsets into buf via args_in_buf_mask. The
kernel copies buf at INJECT time and acts on its kernel-side copy
thereafter. The supervisor commits by sending NOTIF_SEND with
FLAG_INJECTED.

v1 injectable-syscall whitelist:

  - openat (filp_open + fd_install)
  - bind   (sockfd_lookup + kernel_bind)
  - write  (kernel_write)

These cover the bulk of Sandlock-shape policies. execveat is deferred
to v2 follow-up (kernel_execveat in the trapped task's context needs
careful framing, and the argv/envp packed format deserves its own
review round). sendmsg/recvmsg are deferred because msghdr's nested
pointers (msg_name + msg_iov array + msg_control with possible
SCM_RIGHTS) do not fit the flat-buffer model without a richer
packed-format design.

Per-syscall plumbing is small: each injector is ~20 LOC. Adding a
new injectable syscall is a self-contained patch (one new entry in
seccomp_injector_for() plus a new injector function). The framework
does not need to know about new syscalls; only the per-syscall
injector parses buf for its own layout.

## Relation to ptrace

ptrace can already inject syscalls via PTRACE_SETREGSET +
PTRACE_POKEDATA. CRIU, gdb and most container tooling use this
pattern. INJECT does not add a new kernel capability. It provides a
listener-fd-gated, syscall-whitelisted, narrower interface to that
capability for unprivileged supervisors, where ptrace's privilege
model (CAP_SYS_PTRACE / Yama PR_SET_PTRACER) and per-syscall overhead
(signal-stop cycle plus O(N) peer-thread coordination required for
race-free reads) are not viable.

SECCOMP_IOCTL_NOTIF_ADDFD set the precedent for this kind of narrow
listener-fd interface to a ptrace-overlapping capability. INJECT is
ADDFD generalized along one more axis: not just fd substitution, but
substitute-with-kernel-validated-args for a handful of syscalls
where Sandlock-shape policies need it.

## Why this closes TOCTOU end-to-end

For pointer args placed in buf, the kernel reads from kernel memory
that the supervisor populated at INJECT time. The agent's mm and
buf are decoupled after that ioctl returns: peers can mutate user
memory all they want, but the kernel never re-reads it for the
substituted syscall. The supervisor commits bytes; the kernel
performs the operation; the trapped task gets the result.

The remaining race (peer mutates agent mm before the supervisor's
process_vm_readv) is a userspace consistency concern, not a kernel
bypass: the supervisor decides on bytes it observed, and the kernel
performs the operation the supervisor authorized. The peer cannot
cause the kernel to act on bytes the supervisor never inspected.

## Lifecycle, bounds, accounting

  - One-shot: the inject record attaches to the knotif on INJECT and
    is consumed by NOTIF_SEND with FLAG_INJECTED. It is freed on
    listener close, task exit, supervisor changing its mind (CONTINUE
    or plain deny), or SIGKILL of the trapped task.
  - 1 MiB hardcoded cap on buf_size (defensive ceiling, not a tunable).
  - GFP_KERNEL_ACCOUNT so the trapped task's memcg pays.
  - The substitute nr must match the trapped syscall's number (no
    confused-deputy conversion from one syscall family to another).

## Testing

The selftest binary covers openat, bind and write end-to-end (child
issues syscall with one set of args; supervisor injects substitute;
verify kernel acted on supervisor-supplied bytes), plus negative
paths: unsupported syscall (-EOPNOTSUPP), trapped/inject syscall
mismatch (-ESRCH), double INJECT (-EEXIST), CONTINUE+INJECTED
conflict (-EINVAL), INJECTED without prior attach (-EINVAL). All
seven cases pass on x86_64.

---
Cong Wang (3):
  seccomp: add SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
  selftests/seccomp: add seccomp_notif_inject coverage
  Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_INJECT

 .../userspace-api/seccomp_filter.rst          |  42 ++
 MAINTAINERS                                   |   2 +
 include/uapi/linux/seccomp.h                  |  65 +++
 kernel/Makefile                               |   1 +
 kernel/seccomp.c                              | 121 ++++-
 kernel/seccomp_inject.c                       | 281 ++++++++++++
 kernel/seccomp_inject.h                       |  65 +++
 tools/testing/selftests/seccomp/.gitignore    |   1 +
 tools/testing/selftests/seccomp/Makefile      |   2 +-
 .../selftests/seccomp/seccomp_notif_inject.c  | 434 ++++++++++++++++++
 10 files changed, 1011 insertions(+), 3 deletions(-)
 create mode 100644 kernel/seccomp_inject.c
 create mode 100644 kernel/seccomp_inject.h
 create mode 100644 tools/testing/selftests/seccomp/seccomp_notif_inject.c

--
2.43.0
Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
Posted by Andy Lutomirski 2 weeks, 2 days ago
On Thu, May 14, 2026 at 9:27 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> From: Cong Wang <cwang@multikernel.io>
>
> This is a complete rework of v1 (PIN_ARGS), reshaped to address the
> review feedback that having every syscall-arg fetch site consult a
> per-task pin pointer is cross-cutting awareness that does not scale.
>
> v1 thread:
> https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@gmail.com
>
> ## Changes since v1

Here are some thoughts:

>
> v2 inverts the model. The supervisor no longer pins args for a
> resumed syscall body to consume; it describes a substitute syscall
> (nr + args[6]) whose pointer-shaped args are encoded as byte offsets
> into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
> trapped task wakes inside seccomp_do_user_notification(), dispatches
> into a kernel-mode syscall helper (filp_open / kernel_bind /
> kernel_write for v1), and the helper's return value becomes the
> trapped syscall's return value. The trapped task's user mm is never
> re-read for the substituted syscall.

This sounds like it could be done well or it could be done poorly.
Doing it poorly sounds like it would resemble set_fs(), and set_fs()
was awful.  Please don't reintroduce it or anything like it.

Doing it well sounds like introducing a bunch of new entrypoints.  In
some sense this seems like a nice plan, except that essentially every
syscall doesn't work like that.  So getting any sort of decent
coverage could involve extensive kernel changes and might involve
adding lots of new entrypoints that are only used for this new system,
which isn't great.

> The full motivation, including the threat model (adversarial AI
> agents in the same address space) and the concrete user (Sandlock,
> https://github.com/multikernel/sandlock), is in the v1 cover letter
> above.

Whoa there.  "adversarial AI agents" aren't a threat model that makes
sense in this context.  I think you mean "multiple tasks, all running
untrusted code, potentially sharing an address space".


But here are some other thoughts:

> v1 injectable-syscall whitelist:
>
>   - openat (filp_open + fd_install)
>   - bind   (sockfd_lookup + kernel_bind)
>   - write  (kernel_write)

How gnarly would an actual API for this be?  By "actual API" I mean an
fd that represents complete control over a target task (which the
existing seccomp fd sort of is) and syscalls issued against that fd
that do openat, bind, read, write, etc.

Or... what if there was a nice way to create a pinned mapping (and
verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
supervisor owns.  Then the supervisor could write syscall args into it
and re-point pointers into it.

Also, for actual correct ABI compatibility, by the time the syscall's
caller is resumed, the original arguments except the return value
should be restored, because those registers are caller-saved at least
on x86.

--Andy
Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
Posted by Cong Wang 2 weeks ago
Hi Andy,

On Tue, May 26, 2026 at 12:03 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Thu, May 14, 2026 at 9:27 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > From: Cong Wang <cwang@multikernel.io>
> >
> > This is a complete rework of v1 (PIN_ARGS), reshaped to address the
> > review feedback that having every syscall-arg fetch site consult a
> > per-task pin pointer is cross-cutting awareness that does not scale.
> >
> > v1 thread:
> > https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@gmail.com
> >
> > ## Changes since v1
>
> Here are some thoughts:
>
> >
> > v2 inverts the model. The supervisor no longer pins args for a
> > resumed syscall body to consume; it describes a substitute syscall
> > (nr + args[6]) whose pointer-shaped args are encoded as byte offsets
> > into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
> > trapped task wakes inside seccomp_do_user_notification(), dispatches
> > into a kernel-mode syscall helper (filp_open / kernel_bind /
> > kernel_write for v1), and the helper's return value becomes the
> > trapped syscall's return value. The trapped task's user mm is never
> > re-read for the substituted syscall.
>
> This sounds like it could be done well or it could be done poorly.
> Doing it poorly sounds like it would resemble set_fs(), and set_fs()
> was awful.  Please don't reintroduce it or anything like it.

Noted.

For v2, the injectors call filp_open(), kernel_bind(), kernel_write(),
kernel-pointer entrypoints that exist specifically so kernel callers
don't have to spoof user-space.

>
> Doing it well sounds like introducing a bunch of new entrypoints.  In
> some sense this seems like a nice plan, except that essentially every
> syscall doesn't work like that.  So getting any sort of decent
> coverage could involve extensive kernel changes and might involve
> adding lots of new entrypoints that are only used for this new system,
> which isn't great.

This is a valid concern.

Currently, we only have 3 entrypoints which all have pre-existing
kernel-side API's. In the future, we may need to extend it, for example,
for execve().

However, the number of syscalls we inject is still very small, compared
with the total number of syscalls on Linux. I'd never anticipate this list
to grow beyond 10, since most of the syscall injections don't have
TOCTOU issues at all.

>
> > The full motivation, including the threat model (adversarial AI
> > agents in the same address space) and the concrete user (Sandlock,
> > https://github.com/multikernel/sandlock), is in the v1 cover letter
> > above.
>
> Whoa there.  "adversarial AI agents" aren't a threat model that makes
> sense in this context.  I think you mean "multiple tasks, all running
> untrusted code, potentially sharing an address space".

Right, I will update the wording.

>
>
> But here are some other thoughts:
>
> > v1 injectable-syscall whitelist:
> >
> >   - openat (filp_open + fd_install)
> >   - bind   (sockfd_lookup + kernel_bind)
> >   - write  (kernel_write)
>
> How gnarly would an actual API for this be?  By "actual API" I mean an
> fd that represents complete control over a target task (which the
> existing seccomp fd sort of is) and syscalls issued against that fd
> that do openat, bind, read, write, etc.

Excellent suggestion! How about the following API?

    ioctl(lfd, SECCOMP_IOCTL_NOTIF_RECV, &req);
    /* req.data.nr == __NR_openat; args[1] is target's pointer. */

    read_target_string(req.pid, req.data.args[1], path, sizeof(path));

    if (policy_allows(path)) {
        struct seccomp_notif_target_call call = {
            .id   = req.id,
            .nr   = __NR_openat,
            .args = { AT_FDCWD, (uintptr_t)path, O_RDONLY, 0, 0, 0 },
        };
        ioctl(lfd, SECCOMP_IOCTL_NOTIF_TARGET_CALL, &call);

        struct seccomp_notif_resp resp = {
            .id    = req.id,
            .val   = call.ret >= 0 ? call.ret : 0,
            .error = call.ret >= 0 ? 0 : (int)call.ret,
        };
        ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
    } else {
        struct seccomp_notif_resp resp = {
            .id = req.id, .error = -EACCES,
        };
        ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
    }


>
> Or... what if there was a nice way to create a pinned mapping (and
> verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
> supervisor owns.  Then the supervisor could write syscall args into it
> and re-point pointers into it.

One concrete deployment constraint worth surfacing: Sandlock (and
similar wrappers: Firejail, Bubblewrap-style sandboxes) work by
fork+execve of arbitrary target binaries. The pinned-memfd
approach needs the seal installed in a trusted window, but
execve() replaces the address space, so anything mapped pre-exec
is lost. The window between execve and the first instruction of
the untrusted binary belongs to the dynamic loader (or to nothing
at all for static binaries). not to the supervisor.

>
> Also, for actual correct ABI compatibility, by the time the syscall's
> caller is resumed, the original arguments except the return value
> should be restored, because those registers are caller-saved at least
> on x86.

The current implementation already complies.

Thanks!
Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
Posted by Andy Lutomirski 2 weeks ago
On Thu, May 28, 2026 at 10:42 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi Andy,
>
> On Tue, May 26, 2026 at 12:03 PM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > On Thu, May 14, 2026 at 9:27 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > From: Cong Wang <cwang@multikernel.io>
> > >
> > > This is a complete rework of v1 (PIN_ARGS), reshaped to address the
> > > review feedback that having every syscall-arg fetch site consult a
> > > per-task pin pointer is cross-cutting awareness that does not scale.
> > >
> > > v1 thread:
> > > https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@gmail.com
> > >
> > > ## Changes since v1
> >
> > Here are some thoughts:
> >
> > >
> > > v2 inverts the model. The supervisor no longer pins args for a
> > > resumed syscall body to consume; it describes a substitute syscall
> > > (nr + args[6]) whose pointer-shaped args are encoded as byte offsets
> > > into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
> > > trapped task wakes inside seccomp_do_user_notification(), dispatches
> > > into a kernel-mode syscall helper (filp_open / kernel_bind /
> > > kernel_write for v1), and the helper's return value becomes the
> > > trapped syscall's return value. The trapped task's user mm is never
> > > re-read for the substituted syscall.
> >
> > This sounds like it could be done well or it could be done poorly.
> > Doing it poorly sounds like it would resemble set_fs(), and set_fs()
> > was awful.  Please don't reintroduce it or anything like it.
>
> Noted.
>
> For v2, the injectors call filp_open(), kernel_bind(), kernel_write(),
> kernel-pointer entrypoints that exist specifically so kernel callers
> don't have to spoof user-space.
>
> >
> > Doing it well sounds like introducing a bunch of new entrypoints.  In
> > some sense this seems like a nice plan, except that essentially every
> > syscall doesn't work like that.  So getting any sort of decent
> > coverage could involve extensive kernel changes and might involve
> > adding lots of new entrypoints that are only used for this new system,
> > which isn't great.
>
> This is a valid concern.
>
> Currently, we only have 3 entrypoints which all have pre-existing
> kernel-side API's. In the future, we may need to extend it, for example,
> for execve().
>
> However, the number of syscalls we inject is still very small, compared
> with the total number of syscalls on Linux. I'd never anticipate this list
> to grow beyond 10, since most of the syscall injections don't have
> TOCTOU issues at all.
>
> >
> > > The full motivation, including the threat model (adversarial AI
> > > agents in the same address space) and the concrete user (Sandlock,
> > > https://github.com/multikernel/sandlock), is in the v1 cover letter
> > > above.
> >
> > Whoa there.  "adversarial AI agents" aren't a threat model that makes
> > sense in this context.  I think you mean "multiple tasks, all running
> > untrusted code, potentially sharing an address space".
>
> Right, I will update the wording.
>
> >
> >
> > But here are some other thoughts:
> >
> > > v1 injectable-syscall whitelist:
> > >
> > >   - openat (filp_open + fd_install)
> > >   - bind   (sockfd_lookup + kernel_bind)
> > >   - write  (kernel_write)
> >
> > How gnarly would an actual API for this be?  By "actual API" I mean an
> > fd that represents complete control over a target task (which the
> > existing seccomp fd sort of is) and syscalls issued against that fd
> > that do openat, bind, read, write, etc.
>
> Excellent suggestion! How about the following API?
>
>     ioctl(lfd, SECCOMP_IOCTL_NOTIF_RECV, &req);
>     /* req.data.nr == __NR_openat; args[1] is target's pointer. */
>
>     read_target_string(req.pid, req.data.args[1], path, sizeof(path));
>
>     if (policy_allows(path)) {
>         struct seccomp_notif_target_call call = {
>             .id   = req.id,
>             .nr   = __NR_openat,
>             .args = { AT_FDCWD, (uintptr_t)path, O_RDONLY, 0, 0, 0 },
>         };
>         ioctl(lfd, SECCOMP_IOCTL_NOTIF_TARGET_CALL, &call);
>
>         struct seccomp_notif_resp resp = {
>             .id    = req.id,
>             .val   = call.ret >= 0 ? call.ret : 0,
>             .error = call.ret >= 0 ? 0 : (int)call.ret,
>         };
>         ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
>     } else {
>         struct seccomp_notif_resp resp = {
>             .id = req.id, .error = -EACCES,
>         };
>         ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
>     }
>
>
> >
> > Or... what if there was a nice way to create a pinned mapping (and
> > verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
> > supervisor owns.  Then the supervisor could write syscall args into it
> > and re-point pointers into it.
>
> One concrete deployment constraint worth surfacing: Sandlock (and
> similar wrappers: Firejail, Bubblewrap-style sandboxes) work by
> fork+execve of arbitrary target binaries. The pinned-memfd
> approach needs the seal installed in a trusted window, but
> execve() replaces the address space, so anything mapped pre-exec
> is lost. The window between execve and the first instruction of
> the untrusted binary belongs to the dynamic loader (or to nothing
> at all for static binaries). not to the supervisor.

I don't think this matters.  A good implementation would have the
seccomp ioctl interface (or a new syscall or whatever) be able to set
up the new pinned mapping without any particular cooperation from the
target process.  So the process would start, and it would run freely
until its first syscall, and then you would install the pinned region.
And you would get notified on execve (ideally via a new notification
telling you, specifically, that the address space got cleared) so that
you know that the pinned region is gone (and that no other threads are
running concurrently in that address space!).

And (see below) you still want a way to redirect syscall args such
that they get un-redirected on return.

>
> >
> > Also, for actual correct ABI compatibility, by the time the syscall's
> > caller is resumed, the original arguments except the return value
> > should be restored, because those registers are caller-saved at least
> > on x86.
>
> The current implementation already complies.

Of course it compiles.

But someone out there probably has code that does something like:

void *param = foo;
some_syscall(param);
something_else(param);

On x86_64, param goes in RDI.  Now psABI does *not* say that RDI is
preserved on return from some_syscall, so you *think* that the
compiler will reload RDI prior to calling something_else.  But
syscalls don't obey psABI, and people love to inline them, and I bet
there are programs out there that used inline asm or a compiler that
doesn't target psABI (or a compiler that does but that, with
increasing use of LTO and such, can analyze some_syscall and determine
that it's inline asm inside) and they've set their constraints such
that RDI is *not* clobbered, and the generated code resembles:

SYSCALL
CALL something_else

or

CALL some_syscall_wrapper
CALL something_else

and it works!  Or at least it works as long as syscall restart isn't
hitting one of its excessively weird cases here, which it usually
isn't.  And then they run it under seccomp and it fails because now
RDI really is clobbered.  And it's absolutely miserable to debug.  And
it needs a kernel patch to fix because you don't have a clean,
performant fix in your seccomp user code.

--Andy
Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
Posted by Cong Wang 2 weeks ago
On Thu, May 28, 2026 at 11:15 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> On Thu, May 28, 2026 at 10:42 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Hi Andy,
> >
> > On Tue, May 26, 2026 at 12:03 PM Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > Or... what if there was a nice way to create a pinned mapping (and
> > > verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
> > > supervisor owns.  Then the supervisor could write syscall args into it
> > > and re-point pointers into it.
> >
> > One concrete deployment constraint worth surfacing: Sandlock (and
> > similar wrappers: Firejail, Bubblewrap-style sandboxes) work by
> > fork+execve of arbitrary target binaries. The pinned-memfd
> > approach needs the seal installed in a trusted window, but
> > execve() replaces the address space, so anything mapped pre-exec
> > is lost. The window between execve and the first instruction of
> > the untrusted binary belongs to the dynamic loader (or to nothing
> > at all for static binaries). not to the supervisor.
>
> I don't think this matters.  A good implementation would have the
> seccomp ioctl interface (or a new syscall or whatever) be able to set
> up the new pinned mapping without any particular cooperation from the
> target process.  So the process would start, and it would run freely
> until its first syscall, and then you would install the pinned region.
> And you would get notified on execve (ideally via a new notification
> telling you, specifically, that the address space got cleared) so that
> you know that the pinned region is gone (and that no other threads are
> running concurrently in that address space!).

You are right. I thought it would be hard to implement this non-cooperative
pinned memfd, it turns out it is much easier than I thought.

Please let me know your thoughts on the following design:

  /* 1. Supervisor receives a trap. */
  ioctl(listener_fd, SECCOMP_IOCTL_NOTIF_RECV, &req);

  /* 2. Install a sealed pin in the trapped task's mm. */
  struct seccomp_notif_pin_install pin = {
      .id          = req.id,
      .memfd       = my_memfd,
      .target_addr = PIN_ADDR,
      .size        = PIN_SIZE,
  };
  ioctl(listener_fd, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, &pin);

  /* 3. Write the substitute arg into the pin via our own memfd view. */
  strcpy(sup_view, "/dev/null");

  /* 4. Redirect args[1] into the pin and resume the syscall. */
  struct seccomp_notif_resp_redirect redir = {
      .id        = req.id,
      .flags     = SECCOMP_REDIRECT_FLAG_CONTINUE,
      .args_mask = 1U << 1,
      .ptr_mask  = 1U << 1,
      .args      = { 0, PIN_ADDR, 0, 0, 0, 0 },
  };
  ioctl(listener_fd, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT, &redir);


>
> And (see below) you still want a way to redirect syscall args such
> that they get un-redirected on return.
>
> >
> > >
> > > Also, for actual correct ABI compatibility, by the time the syscall's
> > > caller is resumed, the original arguments except the return value
> > > should be restored, because those registers are caller-saved at least
> > > on x86.
> >
> > The current implementation already complies.
>
> Of course it compiles.
>
> But someone out there probably has code that does something like:
>
> void *param = foo;
> some_syscall(param);
> something_else(param);
>
> On x86_64, param goes in RDI.  Now psABI does *not* say that RDI is
> preserved on return from some_syscall, so you *think* that the
> compiler will reload RDI prior to calling something_else.  But
> syscalls don't obey psABI, and people love to inline them, and I bet
> there are programs out there that used inline asm or a compiler that
> doesn't target psABI (or a compiler that does but that, with
> increasing use of LTO and such, can analyze some_syscall and determine
> that it's inline asm inside) and they've set their constraints such
> that RDI is *not* clobbered, and the generated code resembles:

The above pinned-memfd handles this directly. SEND_REDIRECT
saves the trapped task's original arg registers into the knotif before calling
syscall_set_arguments() with the supervisor's substituted values, and
queues a task_work via task_work_add(TWA_RESUME). The callback
fires at the user-mode boundary in syscall_exit_to_user_mode_work,
before control returns to userspace, and rewrites the masked positions
back to the saved originals via syscall_set_arguments(). The caller
observes its original register contents on the resume.

Thanks!
Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
Posted by Cong Wang 2 weeks, 2 days ago
Hi,

On Thu, May 14, 2026 at 9:27 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> From: Cong Wang <cwang@multikernel.io>
>
> This is a complete rework of v1 (PIN_ARGS), reshaped to address the
> review feedback that having every syscall-arg fetch site consult a
> per-task pin pointer is cross-cutting awareness that does not scale.
>
> v1 thread:
> https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@gmail.com
>
> ## Changes since v1
>
> The previous proposal (SECCOMP_IOCTL_NOTIF_PIN_ARGS) snapshotted
> pointer-arg payloads into kernel buffers and modified four syscall
> fetch sites (getname_flags in fs/namei.c, copy_strings in fs/exec.c,
> move_addr_to_kernel in net/socket.c, import_ubuf in lib/iov_iter.c
> plus new_sync_read/new_sync_write in fs/read_write.c) so the resumed
> syscall body would consume from the snapshot instead of re-reading
> user memory. The reviewer correctly pointed out that this spreads
> "continue-from-snapshotted-state" awareness across the VFS and the
> kernel in general, and that the right shape for this kind of feature
> is one where the syscall layer does not have to care.
>
> v2 inverts the model. The supervisor no longer pins args for a
> resumed syscall body to consume; it describes a substitute syscall
> (nr + args[6]) whose pointer-shaped args are encoded as byte offsets
> into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
> trapped task wakes inside seccomp_do_user_notification(), dispatches
> into a kernel-mode syscall helper (filp_open / kernel_bind /
> kernel_write for v1), and the helper's return value becomes the
> trapped syscall's return value. The trapped task's user mm is never
> re-read for the substituted syscall.

Please let me know your thoughts on this v2 design. I would like to
get feedback before removing the RFC tag.

Thanks!