exec: add spawn templates for repeated executable startup

[RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Li Chen 1 month, 2 weeks ago

Hi,

This is an early RFC for an idea that is probably still rough in both the
UAPI and implementation details. Sorry for the rough edges; I am sending
it now to check whether this direction is worth pursuing and to get
feedback on the kernel/userspace boundary.

The series is based on linux-next version 20260518.

This RFC adds spawn_template, a userspace-controlled exec acceleration
mechanism for runtimes that repeatedly start the same executable with
different argv, envp, and per-spawn file descriptor setup.

The main target is agent runtimes. Modern coding agents repeatedly start
short-lived helper tools such as rg, git, sed, awk, python, node, and
shell wrappers while they inspect and edit a workspace. Those runtimes
already know which tools are hot, and they are also the right place to
decide policy. The kernel does not choose names such as rg, git, or sed.
Userspace opts in by creating a template fd for one executable, then uses
that fd for later spawns. Launchers, shells, and build systems have a
similar repeated-startup shape and could use the same primitive, but the
agent runtime case is the main motivation for this RFC.

The mechanism applies to the executable that userspace asks the kernel to
start. If an agent runtime directly starts /usr/bin/rg, the rg executable
is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
head", the shell is the template target unless the shell itself opts in
when it starts rg and head. The kernel does not parse the shell command
string or rewrite inner commands into template spawns. Userspace has to
call spawn_template for those inner commands explicitly:

    direct exec                 shell wrapper
    -----------                 -------------
    agent                       agent
      template("/usr/bin/rg")     template("/usr/bin/bash")
      spawn rg argv              spawn bash -c "rg ... | head"

    kernel target: rg          kernel target: bash
    rg startup benefits        rg/head need shell opt-in

Several agent runtime discussions are moving toward direct argv-style
exec tools for both security and policy clarity. For example, opencode
issue #2206 proposes an exec tool as a safer alternative to a shell-only
bash tool:

https://github.com/anomalyco/opencode/issues/2206

spawn_template is meant to support both models. Direct exec users can
cache the actual hot tool. Shell-wrapper users can cache the shell and
still reduce shell startup cost. If a shell or an agent runtime later
uses the same API for commands started inside a shell command, those
inner tools can benefit too.

Each spawn still goes through the normal exec path. The template reuses
only metadata that can be revalidated before use. Credential preparation,
permission checks, binary handler checks, secure-exec handling, and LSM
hooks remain on the normal execve path.

The UAPI has two operations. spawn_template_create() creates an
anonymous-inode template fd from either an executable fd or an absolute
executable path. spawn_template_spawn() starts one child from that
template, applies per-spawn fd, cwd, and signal actions, and returns both
pid and pidfd.

fd inheritance is deliberately conservative. By default, after the
requested per-spawn actions have run, the child closes fds above stderr.
An agent runtime can still request traditional inheritance explicitly,
but helper tools do not inherit unrelated secret files or sockets by
accident. The create-time actions fields are reserved and rejected in
this RFC because fd numbers are per-process state, not stable reusable
objects. The caller supplies fd actions for each spawn instead.

A typical agent runtime would keep one template per hot executable and
still build argv, envp, cwd, and pipe wiring for each tool call:

    rg_tmpl = spawn_template_create("/usr/bin/rg");

    for each search request:
        out_r, out_w = pipe_cloexec();
        err_r, err_w = pipe_cloexec();
        actions = [
            FCHDIR(worktree_fd),
            DUP2(out_w, STDOUT_FILENO),
            DUP2(err_w, STDERR_FILENO),
        ];
        child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
        close(out_w);
        close(err_w);
        read out_r and err_r;
        waitid(P_PIDFD, child.pidfd, ...);

A shell-wrapper runtime would use the same shape with a template for
/usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
reduces shell startup cost, but it does not cache rg or head inside that
command unless the shell also opts into spawn_template for commands it
starts internally.

The template pins the executable and denies writes to that file while the
template fd is alive, so cached executable metadata cannot race with a
writer changing the same inode. This means direct in-place writes to the
executable can fail while a runtime keeps a template open. It does not
block the common package-manager update pattern where a new inode is
written and then atomically renamed over the old path. In that case the
old path-created template becomes stale, spawn_template_spawn() rejects
it with ESTALE, and the runtime should close and recreate the template
for the new executable.

    in-place write              package-manager update
    --------------              ----------------------
    template pins old inode     write new inode
    write(old inode) denied     rename(new, "/usr/bin/rg")

    cached metadata safe        old template sees path mismatch
                                spawn_template_spawn() = -ESTALE
                                recreate template for new inode

Each spawn revalidates executable identity before cached metadata is
used. Path-created templates only accept absolute paths: a relative path
such as ./tool depends on cwd, and the same string can name a different
file after chdir. For an absolute path template, each spawn reopens the
path and checks that it still resolves to the executable recorded when
the template was created. If the path now names a replaced file, the
template is stale and userspace should close and recreate it.

A template fd can be passed over SCM_RIGHTS like any other fd, but this
RFC does not treat that as delegation. spawn_template_spawn() only works
while the caller still has the same struct cred object that created the
template. If another task, or the same task after a credential change,
receives the fd, spawn fails instead of running the executable using the
creator's launch authority:

    ordinary fd                         spawn_template fd
    -----------                         -----------------
    A: open log                         A: create rg template
    A -> B: SCM_RIGHTS(fd)              A -> B: SCM_RIGHTS(tfd)

    B: read(fd) = ok                    B: spawn(tfd) = -EACCES
                                        B: create own rg template
                                        B: spawn(own_tfd) = ok

    open-file use is delegated          spawn authority is not delegated

The cached state is intentionally small. The template fd keeps the opened
main executable file, an optional absolute path string, the creator
credential pointer, and the deny-write state. The executable identity key
records device, inode, size, mode, owner, ctime, and mtime, and is
rechecked before cached metadata is used. The ELF cache keeps only the
main executable's ELF header, program header table, and program header
count.

    cached in this RFC          not cached in this RFC
    ------------------          ----------------------
    opened main executable      PT_INTERP metadata
    executable identity key     shared-library graph
    main ELF header             VMA layout metadata
    main ELF program headers    cross-process metadata sharing
    creator cred pointer
    deny-write state

This RFC does not cache ELF interpreter metadata, shared-library
dependency state, or derived mapping-layout state. Shared-library
resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
state. It also does not share cached executable metadata between template
fds created by different processes. Each template owns its small cached
metadata object in this RFC.

Performance
===========

The numbers below come from my separate local autogen-bench project.
autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
fans out concurrent tool-call requests to worker agents. The workload
definitions, generated test files, and subprocess/spawn_template backends
are local to autogen-bench.

The agent-tools preset includes direct tool calls and shell-wrapper forms
for:

rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
python-small, node-small, sh-c, and bash-c.

The benchmark is launch-heavy but not no-op: it searches generated
Python-like source files, reads sample files, runs small Python and
Node.js programs, and runs git status and git diff in a small repository.
It does not include model inference or long-running tool work, so the
numbers mainly describe the short-tool regime.

The subprocess column starts each tool call through the existing
userspace launch path. The spawn_template column creates templates for
hot executables and uses spawn_template_spawn() for later calls.

Total in-flight tool calls stay at 16; only the worker-process split
changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
calls each. The two time_s values are subprocess/spawn_template wall
times.

Workload     Calls  subprocess  spawn_template  time_s       Delta
(workers)    calls  calls/s     calls/s         seconds
1x16         6144      411.04          420.32   14.95/14.62  +2.26%
2x8          6144      666.78          690.08    9.21/8.90   +3.49%
4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%

The table measures the whole mixed workload, including both process
startup and the short tool work done after exec. Since this workload is
launch-heavy, the possible launch-side savings include:

- the template fd keeps an opened executable, avoiding repeated ordinary
  open/path setup for that executable;
- the kernel can reuse cached main-executable ELF header and program
  header metadata after revalidation;
- the fork-and-exec-style launch is submitted as one
  spawn_template_spawn() operation;
- fd, cwd, and signal actions run in the child kernel path instead of
  being driven one syscall at a time by userspace child glue;
- pid and pidfd are returned by the same operation, reducing some
  runtime-side bookkeeping.

In local experiments before this RFC, I also tried caching ELF
interpreter metadata and derived ELF mapping-layout metadata. A focused
repeated-exec benchmark did not show a stable standalone throughput gain
for those two optimizations, so this RFC leaves them out and keeps only
the main executable metadata cache.

I also tried sharing main-executable ELF metadata across template fds
created by different processes for the same executable identity. That can
reduce duplicated metadata memory when many agent worker processes create
their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
it did not show a stable throughput win in local multi-agent tests. It
also adds cache keying, lifetime, invalidation, credential, and namespace
questions to the RFC. This version therefore keeps per-template metadata
ownership and leaves cross-process sharing out.

Sorry again for the rough edges in this RFC. I would appreciate feedback
on whether this direction is useful and what the right API boundary
should be.

Thanks,
Li

[1]: https://github.com/microsoft/autogen

Li Chen (13):
  exec: factor argument setup out of do_execveat_common()
  exec: add an internal helper for opened executables
  file: expose helpers for in-kernel fd actions
  exec: add spawn template UAPI definitions
  exec: add spawn template file descriptors
  exec: add spawn_template_spawn()
  exec: validate spawn template executable identity
  binfmt_elf: cache ELF metadata for spawn templates
  Documentation: describe spawn templates
  exec: require absolute paths for path-created templates
  exec: let close-range actions target the max fd
  syscalls: add generic spawn template entries
  selftests/exec: cover spawn template basics

 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/spawn_template.rst          | 153 +++
 MAINTAINERS                                   |   6 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   3 +-
 fs/Makefile                                   |   2 +-
 fs/binfmt_elf.c                               | 104 +-
 fs/exec.c                                     | 162 ++-
 fs/file.c                                     |  11 +-
 fs/spawn_template.c                           | 619 +++++++++++
 include/linux/binfmts.h                       |  10 +
 include/linux/fdtable.h                       |   2 +
 include/linux/spawn_template.h                |  72 ++
 include/linux/syscalls.h                      |   7 +
 include/uapi/asm-generic/unistd.h             |   7 +-
 include/uapi/linux/spawn_template.h           |  62 ++
 scripts/syscall.tbl                           |   2 +
 tools/testing/selftests/exec/Makefile         |   1 +
 tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
 18 files changed, 2179 insertions(+), 42 deletions(-)
 create mode 100644 Documentation/userspace-api/spawn_template.rst
 create mode 100644 fs/spawn_template.c
 create mode 100644 include/linux/spawn_template.h
 create mode 100644 include/uapi/linux/spawn_template.h
 create mode 100644 tools/testing/selftests/exec/spawn_template.c

-- 
2.52.0

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Gabriel Krisman Bertazi 1 month, 1 week ago

Li Chen <me@linux.beauty> writes:

> Hi,
>
> This is an early RFC for an idea that is probably still rough in both the
> UAPI and implementation details. Sorry for the rough edges; I am sending
> it now to check whether this direction is worth pursuing and to get
> feedback on the kernel/userspace boundary.
>
> The series is based on linux-next version 20260518.
>
> This RFC adds spawn_template, a userspace-controlled exec acceleration
> mechanism for runtimes that repeatedly start the same executable with
> different argv, envp, and per-spawn file descriptor setup.

Have you looked at Josh's proposal to do this over io_uring [1] and my
implementation of it at [2]?  I think io_uring is a very natural
interface for something like this, it will avoid adding a larger API,
since you could, in theory, set up the entire new task context using
regular io_uring operations in an io workqueue and then starting it would
be a matter of forking the pre-configured io thread with a new io_uring
operation.

[1]
https://lpc.events/event/16/contributions/1213/attachments/1012/1945/io-uring-spawn.pdf
[2] https://lwn.net/Articles/1001622/

>
> The main target is agent runtimes. Modern coding agents repeatedly start
> short-lived helper tools such as rg, git, sed, awk, python, node, and
> shell wrappers while they inspect and edit a workspace. Those runtimes
> already know which tools are hot, and they are also the right place to
> decide policy. The kernel does not choose names such as rg, git, or sed.
> Userspace opts in by creating a template fd for one executable, then uses
> that fd for later spawns. Launchers, shells, and build systems have a
> similar repeated-startup shape and could use the same primitive, but the
> agent runtime case is the main motivation for this RFC.
>
> The mechanism applies to the executable that userspace asks the kernel to
> start. If an agent runtime directly starts /usr/bin/rg, the rg executable
> is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
> head", the shell is the template target unless the shell itself opts in
> when it starts rg and head. The kernel does not parse the shell command
> string or rewrite inner commands into template spawns. Userspace has to
> call spawn_template for those inner commands explicitly:
>
>     direct exec                 shell wrapper
>     -----------                 -------------
>     agent                       agent
>       template("/usr/bin/rg")     template("/usr/bin/bash")
>       spawn rg argv              spawn bash -c "rg ... | head"
>
>     kernel target: rg          kernel target: bash
>     rg startup benefits        rg/head need shell opt-in
>
> Several agent runtime discussions are moving toward direct argv-style
> exec tools for both security and policy clarity. For example, opencode
> issue #2206 proposes an exec tool as a safer alternative to a shell-only
> bash tool:
>
> https://github.com/anomalyco/opencode/issues/2206
>
> spawn_template is meant to support both models. Direct exec users can
> cache the actual hot tool. Shell-wrapper users can cache the shell and
> still reduce shell startup cost. If a shell or an agent runtime later
> uses the same API for commands started inside a shell command, those
> inner tools can benefit too.
>
> Each spawn still goes through the normal exec path. The template reuses
> only metadata that can be revalidated before use. Credential preparation,
> permission checks, binary handler checks, secure-exec handling, and LSM
> hooks remain on the normal execve path.
>
> The UAPI has two operations. spawn_template_create() creates an
> anonymous-inode template fd from either an executable fd or an absolute
> executable path. spawn_template_spawn() starts one child from that
> template, applies per-spawn fd, cwd, and signal actions, and returns both
> pid and pidfd.
>
> fd inheritance is deliberately conservative. By default, after the
> requested per-spawn actions have run, the child closes fds above stderr.
> An agent runtime can still request traditional inheritance explicitly,
> but helper tools do not inherit unrelated secret files or sockets by
> accident. The create-time actions fields are reserved and rejected in
> this RFC because fd numbers are per-process state, not stable reusable
> objects. The caller supplies fd actions for each spawn instead.
>
> A typical agent runtime would keep one template per hot executable and
> still build argv, envp, cwd, and pipe wiring for each tool call:
>
>     rg_tmpl = spawn_template_create("/usr/bin/rg");
>
>     for each search request:
>         out_r, out_w = pipe_cloexec();
>         err_r, err_w = pipe_cloexec();
>         actions = [
>             FCHDIR(worktree_fd),
>             DUP2(out_w, STDOUT_FILENO),
>             DUP2(err_w, STDERR_FILENO),
>         ];
>         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
>         close(out_w);
>         close(err_w);
>         read out_r and err_r;
>         waitid(P_PIDFD, child.pidfd, ...);
>
> A shell-wrapper runtime would use the same shape with a template for
> /usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
> reduces shell startup cost, but it does not cache rg or head inside that
> command unless the shell also opts into spawn_template for commands it
> starts internally.
>
> The template pins the executable and denies writes to that file while the
> template fd is alive, so cached executable metadata cannot race with a
> writer changing the same inode. This means direct in-place writes to the
> executable can fail while a runtime keeps a template open. It does not
> block the common package-manager update pattern where a new inode is
> written and then atomically renamed over the old path. In that case the
> old path-created template becomes stale, spawn_template_spawn() rejects
> it with ESTALE, and the runtime should close and recreate the template
> for the new executable.
>
>     in-place write              package-manager update
>     --------------              ----------------------
>     template pins old inode     write new inode
>     write(old inode) denied     rename(new, "/usr/bin/rg")
>
>     cached metadata safe        old template sees path mismatch
>                                 spawn_template_spawn() = -ESTALE
>                                 recreate template for new inode
>
> Each spawn revalidates executable identity before cached metadata is
> used. Path-created templates only accept absolute paths: a relative path
> such as ./tool depends on cwd, and the same string can name a different
> file after chdir. For an absolute path template, each spawn reopens the
> path and checks that it still resolves to the executable recorded when
> the template was created. If the path now names a replaced file, the
> template is stale and userspace should close and recreate it.
>
> A template fd can be passed over SCM_RIGHTS like any other fd, but this
> RFC does not treat that as delegation. spawn_template_spawn() only works
> while the caller still has the same struct cred object that created the
> template. If another task, or the same task after a credential change,
> receives the fd, spawn fails instead of running the executable using the
> creator's launch authority:
>
>     ordinary fd                         spawn_template fd
>     -----------                         -----------------
>     A: open log                         A: create rg template
>     A -> B: SCM_RIGHTS(fd)              A -> B: SCM_RIGHTS(tfd)
>
>     B: read(fd) = ok                    B: spawn(tfd) = -EACCES
>                                         B: create own rg template
>                                         B: spawn(own_tfd) = ok
>
>     open-file use is delegated          spawn authority is not delegated
>
> The cached state is intentionally small. The template fd keeps the opened
> main executable file, an optional absolute path string, the creator
> credential pointer, and the deny-write state. The executable identity key
> records device, inode, size, mode, owner, ctime, and mtime, and is
> rechecked before cached metadata is used. The ELF cache keeps only the
> main executable's ELF header, program header table, and program header
> count.
>
>     cached in this RFC          not cached in this RFC
>     ------------------          ----------------------
>     opened main executable      PT_INTERP metadata
>     executable identity key     shared-library graph
>     main ELF header             VMA layout metadata
>     main ELF program headers    cross-process metadata sharing
>     creator cred pointer
>     deny-write state
>
> This RFC does not cache ELF interpreter metadata, shared-library
> dependency state, or derived mapping-layout state. Shared-library
> resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
> RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
> state. It also does not share cached executable metadata between template
> fds created by different processes. Each template owns its small cached
> metadata object in this RFC.
>
> Performance
> ===========
>
> The numbers below come from my separate local autogen-bench project.
> autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
> instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
> fans out concurrent tool-call requests to worker agents. The workload
> definitions, generated test files, and subprocess/spawn_template backends
> are local to autogen-bench.
>
> The agent-tools preset includes direct tool calls and shell-wrapper forms
> for:
>
> rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
> python-small, node-small, sh-c, and bash-c.
>
> The benchmark is launch-heavy but not no-op: it searches generated
> Python-like source files, reads sample files, runs small Python and
> Node.js programs, and runs git status and git diff in a small repository.
> It does not include model inference or long-running tool work, so the
> numbers mainly describe the short-tool regime.
>
> The subprocess column starts each tool call through the existing
> userspace launch path. The spawn_template column creates templates for
> hot executables and uses spawn_template_spawn() for later calls.
>
> Total in-flight tool calls stay at 16; only the worker-process split
> changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
> calls each. The two time_s values are subprocess/spawn_template wall
> times.
>
> Workload     Calls  subprocess  spawn_template  time_s       Delta
> (workers)    calls  calls/s     calls/s         seconds
> 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
> 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
> 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
> 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
>
> The table measures the whole mixed workload, including both process
> startup and the short tool work done after exec. Since this workload is
> launch-heavy, the possible launch-side savings include:
>
> - the template fd keeps an opened executable, avoiding repeated ordinary
>   open/path setup for that executable;
> - the kernel can reuse cached main-executable ELF header and program
>   header metadata after revalidation;
> - the fork-and-exec-style launch is submitted as one
>   spawn_template_spawn() operation;
> - fd, cwd, and signal actions run in the child kernel path instead of
>   being driven one syscall at a time by userspace child glue;
> - pid and pidfd are returned by the same operation, reducing some
>   runtime-side bookkeeping.
>
> In local experiments before this RFC, I also tried caching ELF
> interpreter metadata and derived ELF mapping-layout metadata. A focused
> repeated-exec benchmark did not show a stable standalone throughput gain
> for those two optimizations, so this RFC leaves them out and keeps only
> the main executable metadata cache.
>
> I also tried sharing main-executable ELF metadata across template fds
> created by different processes for the same executable identity. That can
> reduce duplicated metadata memory when many agent worker processes create
> their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
> it did not show a stable throughput win in local multi-agent tests. It
> also adds cache keying, lifetime, invalidation, credential, and namespace
> questions to the RFC. This version therefore keeps per-template metadata
> ownership and leaves cross-process sharing out.
>
> Sorry again for the rough edges in this RFC. I would appreciate feedback
> on whether this direction is useful and what the right API boundary
> should be.
>
> Thanks,
> Li
>
> [1]: https://github.com/microsoft/autogen
>
> Li Chen (13):
>   exec: factor argument setup out of do_execveat_common()
>   exec: add an internal helper for opened executables
>   file: expose helpers for in-kernel fd actions
>   exec: add spawn template UAPI definitions
>   exec: add spawn template file descriptors
>   exec: add spawn_template_spawn()
>   exec: validate spawn template executable identity
>   binfmt_elf: cache ELF metadata for spawn templates
>   Documentation: describe spawn templates
>   exec: require absolute paths for path-created templates
>   exec: let close-range actions target the max fd
>   syscalls: add generic spawn template entries
>   selftests/exec: cover spawn template basics
>
>  Documentation/userspace-api/index.rst         |   1 +
>  .../userspace-api/spawn_template.rst          | 153 +++
>  MAINTAINERS                                   |   6 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   3 +-
>  fs/Makefile                                   |   2 +-
>  fs/binfmt_elf.c                               | 104 +-
>  fs/exec.c                                     | 162 ++-
>  fs/file.c                                     |  11 +-
>  fs/spawn_template.c                           | 619 +++++++++++
>  include/linux/binfmts.h                       |  10 +
>  include/linux/fdtable.h                       |   2 +
>  include/linux/spawn_template.h                |  72 ++
>  include/linux/syscalls.h                      |   7 +
>  include/uapi/asm-generic/unistd.h             |   7 +-
>  include/uapi/linux/spawn_template.h           |  62 ++
>  scripts/syscall.tbl                           |   2 +
>  tools/testing/selftests/exec/Makefile         |   1 +
>  tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
>  18 files changed, 2179 insertions(+), 42 deletions(-)
>  create mode 100644 Documentation/userspace-api/spawn_template.rst
>  create mode 100644 fs/spawn_template.c
>  create mode 100644 include/linux/spawn_template.h
>  create mode 100644 include/uapi/linux/spawn_template.h
>  create mode 100644 tools/testing/selftests/exec/spawn_template.c

-- 
Gabriel Krisman Bertazi

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Li Chen 1 month, 1 week ago

Hi Gabriel,

Yes, I looked at Josh's slides and your RFC a few days ago.

I agree that io_uring is a very interesting direction, and I can see why it
fits the "ordered setup operations before exec" model.

My current preference is still to first explore a pidfd/pidfs-based builder,
modeled roughly like fsconfig(). Process creation feels like a core process
lifecycle API, and I think a normal fd-based syscall interface may be easier
for libc, language runtimes, shells,and sandboxing tools to adopt.

My hesitation is practical rather than conceptual.Some important
deployments still disable io_uring entirely; Docker's default seccomp
profile blocks the io_uring syscalls, and Google has disabled or restricted
io_uring in ChromeOS, Android app processes, and production servers.

I will study your io_uring work more carefully and compare the two directions.
One possible outcome is that io_uring can drive/share the same builder object later;
I do not know that yet.

Thanks for pointing this out.

 ---- On Fri, 05 Jun 2026 22:24:00 +0800  Gabriel Krisman Bertazi <krisman@suse.de> wrote --- 
 > Li Chen <me@linux.beauty> writes:
 > 
 > > Hi,
 > >
 > > This is an early RFC for an idea that is probably still rough in both the
 > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > it now to check whether this direction is worth pursuing and to get
 > > feedback on the kernel/userspace boundary.
 > >
 > > The series is based on linux-next version 20260518.
 > >
 > > This RFC adds spawn_template, a userspace-controlled exec acceleration
 > > mechanism for runtimes that repeatedly start the same executable with
 > > different argv, envp, and per-spawn file descriptor setup.
 > 
 > Have you looked at Josh's proposal to do this over io_uring [1] and my
 > implementation of it at [2]?  I think io_uring is a very natural
 > interface for something like this, it will avoid adding a larger API,
 > since you could, in theory, set up the entire new task context using
 > regular io_uring operations in an io workqueue and then starting it would
 > be a matter of forking the pre-configured io thread with a new io_uring
 > operation.
 > 
 > [1]
 > https://lpc.events/event/16/contributions/1213/attachments/1012/1945/io-uring-spawn.pdf
 > [2] https://lwn.net/Articles/1001622/
 > 
 > >
 > > The main target is agent runtimes. Modern coding agents repeatedly start
 > > short-lived helper tools such as rg, git, sed, awk, python, node, and
 > > shell wrappers while they inspect and edit a workspace. Those runtimes
 > > already know which tools are hot, and they are also the right place to
 > > decide policy. The kernel does not choose names such as rg, git, or sed.
 > > Userspace opts in by creating a template fd for one executable, then uses
 > > that fd for later spawns. Launchers, shells, and build systems have a
 > > similar repeated-startup shape and could use the same primitive, but the
 > > agent runtime case is the main motivation for this RFC.
 > >
 > > The mechanism applies to the executable that userspace asks the kernel to
 > > start. If an agent runtime directly starts /usr/bin/rg, the rg executable
 > > is the template target. If the runtime starts /usr/bin/bash -c "rg ... |
 > > head", the shell is the template target unless the shell itself opts in
 > > when it starts rg and head. The kernel does not parse the shell command
 > > string or rewrite inner commands into template spawns. Userspace has to
 > > call spawn_template for those inner commands explicitly:
 > >
 > >     direct exec                 shell wrapper
 > >     -----------                 -------------
 > >     agent                       agent
 > >       template("/usr/bin/rg")     template("/usr/bin/bash")
 > >       spawn rg argv              spawn bash -c "rg ... | head"
 > >
 > >     kernel target: rg          kernel target: bash
 > >     rg startup benefits        rg/head need shell opt-in
 > >
 > > Several agent runtime discussions are moving toward direct argv-style
 > > exec tools for both security and policy clarity. For example, opencode
 > > issue #2206 proposes an exec tool as a safer alternative to a shell-only
 > > bash tool:
 > >
 > > https://github.com/anomalyco/opencode/issues/2206
 > >
 > > spawn_template is meant to support both models. Direct exec users can
 > > cache the actual hot tool. Shell-wrapper users can cache the shell and
 > > still reduce shell startup cost. If a shell or an agent runtime later
 > > uses the same API for commands started inside a shell command, those
 > > inner tools can benefit too.
 > >
 > > Each spawn still goes through the normal exec path. The template reuses
 > > only metadata that can be revalidated before use. Credential preparation,
 > > permission checks, binary handler checks, secure-exec handling, and LSM
 > > hooks remain on the normal execve path.
 > >
 > > The UAPI has two operations. spawn_template_create() creates an
 > > anonymous-inode template fd from either an executable fd or an absolute
 > > executable path. spawn_template_spawn() starts one child from that
 > > template, applies per-spawn fd, cwd, and signal actions, and returns both
 > > pid and pidfd.
 > >
 > > fd inheritance is deliberately conservative. By default, after the
 > > requested per-spawn actions have run, the child closes fds above stderr.
 > > An agent runtime can still request traditional inheritance explicitly,
 > > but helper tools do not inherit unrelated secret files or sockets by
 > > accident. The create-time actions fields are reserved and rejected in
 > > this RFC because fd numbers are per-process state, not stable reusable
 > > objects. The caller supplies fd actions for each spawn instead.
 > >
 > > A typical agent runtime would keep one template per hot executable and
 > > still build argv, envp, cwd, and pipe wiring for each tool call:
 > >
 > >     rg_tmpl = spawn_template_create("/usr/bin/rg");
 > >
 > >     for each search request:
 > >         out_r, out_w = pipe_cloexec();
 > >         err_r, err_w = pipe_cloexec();
 > >         actions = [
 > >             FCHDIR(worktree_fd),
 > >             DUP2(out_w, STDOUT_FILENO),
 > >             DUP2(err_w, STDERR_FILENO),
 > >         ];
 > >         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
 > >         close(out_w);
 > >         close(err_w);
 > >         read out_r and err_r;
 > >         waitid(P_PIDFD, child.pidfd, ...);
 > >
 > > A shell-wrapper runtime would use the same shape with a template for
 > > /usr/bin/bash and argv such as ["/usr/bin/bash", "-c", command]. That
 > > reduces shell startup cost, but it does not cache rg or head inside that
 > > command unless the shell also opts into spawn_template for commands it
 > > starts internally.
 > >
 > > The template pins the executable and denies writes to that file while the
 > > template fd is alive, so cached executable metadata cannot race with a
 > > writer changing the same inode. This means direct in-place writes to the
 > > executable can fail while a runtime keeps a template open. It does not
 > > block the common package-manager update pattern where a new inode is
 > > written and then atomically renamed over the old path. In that case the
 > > old path-created template becomes stale, spawn_template_spawn() rejects
 > > it with ESTALE, and the runtime should close and recreate the template
 > > for the new executable.
 > >
 > >     in-place write              package-manager update
 > >     --------------              ----------------------
 > >     template pins old inode     write new inode
 > >     write(old inode) denied     rename(new, "/usr/bin/rg")
 > >
 > >     cached metadata safe        old template sees path mismatch
 > >                                 spawn_template_spawn() = -ESTALE
 > >                                 recreate template for new inode
 > >
 > > Each spawn revalidates executable identity before cached metadata is
 > > used. Path-created templates only accept absolute paths: a relative path
 > > such as ./tool depends on cwd, and the same string can name a different
 > > file after chdir. For an absolute path template, each spawn reopens the
 > > path and checks that it still resolves to the executable recorded when
 > > the template was created. If the path now names a replaced file, the
 > > template is stale and userspace should close and recreate it.
 > >
 > > A template fd can be passed over SCM_RIGHTS like any other fd, but this
 > > RFC does not treat that as delegation. spawn_template_spawn() only works
 > > while the caller still has the same struct cred object that created the
 > > template. If another task, or the same task after a credential change,
 > > receives the fd, spawn fails instead of running the executable using the
 > > creator's launch authority:
 > >
 > >     ordinary fd                         spawn_template fd
 > >     -----------                         -----------------
 > >     A: open log                         A: create rg template
 > >     A -> B: SCM_RIGHTS(fd)              A -> B: SCM_RIGHTS(tfd)
 > >
 > >     B: read(fd) = ok                    B: spawn(tfd) = -EACCES
 > >                                         B: create own rg template
 > >                                         B: spawn(own_tfd) = ok
 > >
 > >     open-file use is delegated          spawn authority is not delegated
 > >
 > > The cached state is intentionally small. The template fd keeps the opened
 > > main executable file, an optional absolute path string, the creator
 > > credential pointer, and the deny-write state. The executable identity key
 > > records device, inode, size, mode, owner, ctime, and mtime, and is
 > > rechecked before cached metadata is used. The ELF cache keeps only the
 > > main executable's ELF header, program header table, and program header
 > > count.
 > >
 > >     cached in this RFC          not cached in this RFC
 > >     ------------------          ----------------------
 > >     opened main executable      PT_INTERP metadata
 > >     executable identity key     shared-library graph
 > >     main ELF header             VMA layout metadata
 > >     main ELF program headers    cross-process metadata sharing
 > >     creator cred pointer
 > >     deny-write state
 > >
 > > This RFC does not cache ELF interpreter metadata, shared-library
 > > dependency state, or derived mapping-layout state. Shared-library
 > > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
 > > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
 > > state. It also does not share cached executable metadata between template
 > > fds created by different processes. Each template owns its small cached
 > > metadata object in this RFC.
 > >
 > > Performance
 > > ===========
 > >
 > > The numbers below come from my separate local autogen-bench project.
 > > autogen-bench uses AutoGen [1] Core as the agent harness: RoutedAgent
 > > instances run under SingleThreadedAgentRuntime, and RPC-style dispatch
 > > fans out concurrent tool-call requests to worker agents. The workload
 > > definitions, generated test files, and subprocess/spawn_template backends
 > > are local to autogen-bench.
 > >
 > > The agent-tools preset includes direct tool calls and shell-wrapper forms
 > > for:
 > >
 > > rg, grep, sed, awk, cat, head, tail, find, stat, ls, git-status, git-diff,
 > > python-small, node-small, sh-c, and bash-c.
 > >
 > > The benchmark is launch-heavy but not no-op: it searches generated
 > > Python-like source files, reads sample files, runs small Python and
 > > Node.js programs, and runs git status and git diff in a small repository.
 > > It does not include model inference or long-running tool work, so the
 > > numbers mainly describe the short-tool regime.
 > >
 > > The subprocess column starts each tool call through the existing
 > > userspace launch path. The spawn_template column creates templates for
 > > hot executables and uses spawn_template_spawn() for later calls.
 > >
 > > Total in-flight tool calls stay at 16; only the worker-process split
 > > changes. For example, 4x4 means 4 worker processes with 4 in-flight tool
 > > calls each. The two time_s values are subprocess/spawn_template wall
 > > times.
 > >
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > >
 > > The table measures the whole mixed workload, including both process
 > > startup and the short tool work done after exec. Since this workload is
 > > launch-heavy, the possible launch-side savings include:
 > >
 > > - the template fd keeps an opened executable, avoiding repeated ordinary
 > >   open/path setup for that executable;
 > > - the kernel can reuse cached main-executable ELF header and program
 > >   header metadata after revalidation;
 > > - the fork-and-exec-style launch is submitted as one
 > >   spawn_template_spawn() operation;
 > > - fd, cwd, and signal actions run in the child kernel path instead of
 > >   being driven one syscall at a time by userspace child glue;
 > > - pid and pidfd are returned by the same operation, reducing some
 > >   runtime-side bookkeeping.
 > >
 > > In local experiments before this RFC, I also tried caching ELF
 > > interpreter metadata and derived ELF mapping-layout metadata. A focused
 > > repeated-exec benchmark did not show a stable standalone throughput gain
 > > for those two optimizations, so this RFC leaves them out and keeps only
 > > the main executable metadata cache.
 > >
 > > I also tried sharing main-executable ELF metadata across template fds
 > > created by different processes for the same executable identity. That can
 > > reduce duplicated metadata memory when many agent worker processes create
 > > their own templates for /usr/bin/rg, /usr/bin/git, and similar tools, but
 > > it did not show a stable throughput win in local multi-agent tests. It
 > > also adds cache keying, lifetime, invalidation, credential, and namespace
 > > questions to the RFC. This version therefore keeps per-template metadata
 > > ownership and leaves cross-process sharing out.
 > >
 > > Sorry again for the rough edges in this RFC. I would appreciate feedback
 > > on whether this direction is useful and what the right API boundary
 > > should be.
 > >
 > > Thanks,
 > > Li
 > >
 > > [1]: https://github.com/microsoft/autogen
 > >
 > > Li Chen (13):
 > >   exec: factor argument setup out of do_execveat_common()
 > >   exec: add an internal helper for opened executables
 > >   file: expose helpers for in-kernel fd actions
 > >   exec: add spawn template UAPI definitions
 > >   exec: add spawn template file descriptors
 > >   exec: add spawn_template_spawn()
 > >   exec: validate spawn template executable identity
 > >   binfmt_elf: cache ELF metadata for spawn templates
 > >   Documentation: describe spawn templates
 > >   exec: require absolute paths for path-created templates
 > >   exec: let close-range actions target the max fd
 > >   syscalls: add generic spawn template entries
 > >   selftests/exec: cover spawn template basics
 > >
 > >  Documentation/userspace-api/index.rst         |   1 +
 > >  .../userspace-api/spawn_template.rst          | 153 +++
 > >  MAINTAINERS                                   |   6 +
 > >  arch/x86/entry/syscalls/syscall_64.tbl        |   3 +-
 > >  fs/Makefile                                   |   2 +-
 > >  fs/binfmt_elf.c                               | 104 +-
 > >  fs/exec.c                                     | 162 ++-
 > >  fs/file.c                                     |  11 +-
 > >  fs/spawn_template.c                           | 619 +++++++++++
 > >  include/linux/binfmts.h                       |  10 +
 > >  include/linux/fdtable.h                       |   2 +
 > >  include/linux/spawn_template.h                |  72 ++
 > >  include/linux/syscalls.h                      |   7 +
 > >  include/uapi/asm-generic/unistd.h             |   7 +-
 > >  include/uapi/linux/spawn_template.h           |  62 ++
 > >  scripts/syscall.tbl                           |   2 +
 > >  tools/testing/selftests/exec/Makefile         |   1 +
 > >  tools/testing/selftests/exec/spawn_template.c | 997 ++++++++++++++++++
 > >  18 files changed, 2179 insertions(+), 42 deletions(-)
 > >  create mode 100644 Documentation/userspace-api/spawn_template.rst
 > >  create mode 100644 fs/spawn_template.c
 > >  create mode 100644 include/linux/spawn_template.h
 > >  create mode 100644 include/uapi/linux/spawn_template.h
 > >  create mode 100644 tools/testing/selftests/exec/spawn_template.c
 > 
 > -- 
 > Gabriel Krisman Bertazi
 > 

Regards,
Li

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Andy Lutomirski 1 month, 2 weeks ago

On Thu, May 28, 2026 at 2:55 AM Li Chen <me@linux.beauty> wrote:
>

>
> The template pins the executable and denies writes to that file while the
> template fd is alive,

Please don't.  *Maybe* detect when it gets modified and clear your cache.

Or develop a generic way to open a new fd that's an immutable view
into an existing file such that the fd retains its contents even if
the file changes.  (Think a reflink that's not persistent and has no
name -- you'll need some way to avoid resource exhaustion.)

>
> Workload     Calls  subprocess  spawn_template  time_s       Delta
> (workers)    calls  calls/s     calls/s         seconds
> 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
> 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
> 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
> 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%

This is a lot of complexity in the kernel for a teeny tiny gain.

I'm with Christian -- a better spawn API would be great (and much
faster than fork/vfork + exec), but that's a different patch.

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Li Chen 1 month, 2 weeks ago

Hi Andy,

 ---- On Fri, 29 May 2026 02:27:00 +0800  Andy Lutomirski <luto@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 2:55 AM Li Chen <me@linux.beauty> wrote:
 > >
 > 
 > >
 > > The template pins the executable and denies writes to that file while the
 > > template fd is alive,
 > 
 > Please don't.  *Maybe* detect when it gets modified and clear your cache.
 > 
 > Or develop a generic way to open a new fd that's an immutable view
 > into an existing file such that the fd retains its contents even if
 > the file changes.  (Think a reflink that's not persistent and has no
 > name -- you'll need some way to avoid resource exhaustion.)

 I agree that deny-write is not a good long-term invalidation model. I had
 considered clear-cache-on-modify, but kept this RFC smaller.

 > >
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > 
 > This is a lot of complexity in the kernel for a teeny tiny gain.
 > 
 > I'm with Christian -- a better spawn API would be great (and much
 > faster than fork/vfork + exec), but that's a different patch.
 
 Thanks, I agree. A pidfd/pidfs spawn builder looks like the much better API shape.

 The cover letter numbers were from a mixed agent-tool workload. For very short
 single-tool runs I saw larger wins, about +14% for printf-style work.
 I should have called that out separately.

 I will work toward a pidfd_config-style builder next.

Regards,

Li

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Mateusz Guzik 1 month, 2 weeks ago

On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> This RFC adds spawn_template, a userspace-controlled exec acceleration
> mechanism for runtimes that repeatedly start the same executable with
> different argv, envp, and per-spawn file descriptor setup.
> 
> The main target is agent runtimes. Modern coding agents repeatedly start
> short-lived helper tools such as rg, git, sed, awk, python, node, and
> shell wrappers while they inspect and edit a workspace. Those runtimes
> already know which tools are hot, and they are also the right place to
> decide policy. The kernel does not choose names such as rg, git, or sed.
> Userspace opts in by creating a template fd for one executable, then uses
> that fd for later spawns. Launchers, shells, and build systems have a
> similar repeated-startup shape and could use the same primitive, but the
> agent runtime case is the main motivation for this RFC.
> 
[..]
> A typical agent runtime would keep one template per hot executable and
> still build argv, envp, cwd, and pipe wiring for each tool call:
> 
>     rg_tmpl = spawn_template_create("/usr/bin/rg");
> 
>     for each search request:
>         out_r, out_w = pipe_cloexec();
>         err_r, err_w = pipe_cloexec();
>         actions = [
>             FCHDIR(worktree_fd),
>             DUP2(out_w, STDOUT_FILENO),
>             DUP2(err_w, STDERR_FILENO),
>         ];
>         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
>         close(out_w);
>         close(err_w);
>         read out_r and err_r;
>         waitid(P_PIDFD, child.pidfd, ...);
> 
> 
[..]
> The cached state is intentionally small. The template fd keeps the opened
> main executable file, an optional absolute path string, the creator
> credential pointer, and the deny-write state. The executable identity key
> records device, inode, size, mode, owner, ctime, and mtime, and is
> rechecked before cached metadata is used. The ELF cache keeps only the
> main executable's ELF header, program header table, and program header
> count.
> 
>     cached in this RFC          not cached in this RFC
>     ------------------          ----------------------
>     opened main executable      PT_INTERP metadata
>     executable identity key     shared-library graph
>     main ELF header             VMA layout metadata
>     main ELF program headers    cross-process metadata sharing
>     creator cred pointer
>     deny-write state
> 
> This RFC does not cache ELF interpreter metadata, shared-library
> dependency state, or derived mapping-layout state. Shared-library
> resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
> RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
> state. It also does not share cached executable metadata between template
> fds created by different processes. Each template owns its small cached
> metadata object in this RFC.
> 
> Performance
> ===========
> 
[..]
> Workload     Calls  subprocess  spawn_template  time_s       Delta
> (workers)    calls  calls/s     calls/s         seconds
> 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
> 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
> 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
> 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
> 

This problem is dear to my heart and I have been pondering it on and off
for some time now. The entire fork + exec idiom is terrible and needs to
be retired.

Is this vibe-coded? I asked claude for in-kernel posix_spawn for kicks
some time ago and it generated remarkably similar code. But that's a
tangent.

I'm rather confused by the angle in the patchset. Most of this shaves
off a tiny amount of work, while retaining the primary avoidable reason
for bad performance: the very fact that fork is part of the picture,
especially the part mucking with mm. Creating a pristine process is the
way to go.

Additionally there is a known problem where transiently copied file
descriptors on fork + exec cause a headache in multithreaded programs
doing something like this in parallel. I only did cursory reading, it
seems your patchset keeps the same problem in place.

There are numerous impactful ways to speed up execs both in terms of
single-threaded cost and their multicore scalability, most of which
would be immediately usable by all programs without an opt-in. imo these
needs to be exhausted before something like a "template" can be
considered.

Per the above, the primary win would stem from *NOT* messing with mm.

As in, whatever the interface, it needs to create an "empty" target
process (for lack of a better term).

In terms of userspace-visible APIs, a clean solution escapes me.

Some time ago I proposed returning a handle which is populated over time
by the parnet-to-be. One of the problems with it I failed to consider at
the time is NUMA locality -- what if the process to be created is going
to run on another domain? For example, opening and installing a file for
its later use will result in avoidable loss of locality for some of the
in-kernel data. That's on top of the fd vs fork problem.

From perf standpoint, the final goal of whatever mechanism should be a
state where the target process avoided copying any state it did not need
to and which allocated any memory it needed from local NUMA node
(whatever it may happen to be). Of course if no affinity is assigned it
may happen to move again and lose such locality, nothing can be done
about that. But pretend the process is to run in a specific node the
parent is NOT running in.

So I think the pragmatic way forward is to implement something close to
posix_spawn in the kernel. It may make sense for the thing to take the
PATH argument for repeated exec attempts. I understand this is of no use
in your particular case, but it very much IS of use for most of the
real-world. The initial implementation might even start with doing vfork
just to get it off the ground.

The next step would be to extend the interface with means to AVOID
copying any file descriptors. There could be a dedicated file action
which tells the kernel to avoid such copies or something like a
close_range file action (or close_from) -- with a range like <0, INT_MAX>
you know no fds are copied.

For the NUMA angle to be sorted out, any file action which opens a file
or dups from the parent needs to execute in the child. And frankly
something would be needed to ask the scheduler where does it think the
child is going to run, so that the task_struct itself can also be
allocated with the right backing.

I have not looked into what's needed to create a new process and NOT
mess with mm, but I don't think there are unsolvable problems there, at
worst some churn.

There are of course other parameters which need to be sorted out, that's
covered by the posix_spawn thing.

This e-mail is long enough, so I'm not going to go into issues
concerning exec itself right now.

tl;dr I would suggest redoing the patchset as posix_spawn and then doing
the actual optimization of not cloning mm itself.

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Jann Horn 1 month, 1 week ago

On Thu, May 28, 2026 at 2:55 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> This problem is dear to my heart and I have been pondering it on and off
> for some time now. The entire fork + exec idiom is terrible and needs to
> be retired.

It seems to me like vfork+exec is a decent UAPI building block, on
which you can build nice-looking userspace APIs, though I agree that
this is not an ideal direct interface for application code.

> Additionally there is a known problem where transiently copied file
> descriptors on fork + exec cause a headache in multithreaded programs
> doing something like this in parallel. I only did cursory reading, it
> seems your patchset keeps the same problem in place.

I think we almost have UAPI that would let you avoid this issue?
You can use clone() with CLONE_FILES, then unshare the FD table with
close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE). That is not currently
implemented to be atomic with stuff that happens on other threads, but
if we changed that, and it doesn't provide a good way to carry some
FDs across, but it feels to me like this could be fixed with a variant
of close_range() that removes O_CLOEXEC FDs except ones listed in an
array.

> There are numerous impactful ways to speed up execs both in terms of
> single-threaded cost and their multicore scalability, most of which
> would be immediately usable by all programs without an opt-in. imo these
> needs to be exhausted before something like a "template" can be
> considered.

(I think probably a large part of this would be stuff that happens in
userspace, like dynamic linking.)

> Per the above, the primary win would stem from *NOT* messing with mm.

As you write below, I think we have that with CLONE_MM? The C function
vfork() is kind of a terrible API because of its returns-twice
behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
wrapped by libc in a way similar to clone() (with the child executing
a separate handler function), or if it was used in the implementation
of some higher-level process-spawning API, it would be a perfectly
fine API?

Or am I misunderstanding what you mean by "messing with mm"?

> As in, whatever the interface, it needs to create an "empty" target
> process (for lack of a better term).
>
> In terms of userspace-visible APIs, a clean solution escapes me.

I think we already have relatively good API for this - you can use
clone() to create something that initially shares almost all the state
that a thread would, and then incrementally unshare resources and go
through execve().

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Mateusz Guzik 1 month ago

On Mon, Jun 8, 2026 at 5:02 PM Jann Horn <jannh@google.com> wrote:
>
> On Thu, May 28, 2026 at 2:55 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > This problem is dear to my heart and I have been pondering it on and off
> > for some time now. The entire fork + exec idiom is terrible and needs to
> > be retired.
>
> It seems to me like vfork+exec is a decent UAPI building block, on
> which you can build nice-looking userspace APIs, though I agree that
> this is not an ideal direct interface for application code.
>
> > Additionally there is a known problem where transiently copied file
> > descriptors on fork + exec cause a headache in multithreaded programs
> > doing something like this in parallel. I only did cursory reading, it
> > seems your patchset keeps the same problem in place.
>
> I think we almost have UAPI that would let you avoid this issue?
> You can use clone() with CLONE_FILES, then unshare the FD table with
> close_range(3, UINT_MAX, CLOSE_RANGE_UNSHARE). That is not currently
> implemented to be atomic with stuff that happens on other threads, but
> if we changed that, and it doesn't provide a good way to carry some
> FDs across, but it feels to me like this could be fixed with a variant
> of close_range() that removes O_CLOEXEC FDs except ones listed in an
> array.

Suppose you want to exec a binary with the following fd set:
0 is /dev/null
1 is fd 1023 in your process
2 is fd 1023 in your process

You have tons of other fds and you don't want any of them anywhere near this.

Clean interface from my standpoint would avoid any unnecessary
overhead and would allow you to clearly specify what do you want.

In this case whatever the interface it should provide the ability to
map 1023 to 1 and 2 in the child. With the current syscall set you get
refs taken on these on clone, then you have to manually dup2 these
which is separate syscalls with extra atomics on top. A fast & elegant
solution would allow you to tell the kernel directly where to install
the 2 files.

Also note in practical terms userspace likes to closefrom/close_range
anyway to get rid of unwanted fds which happen to not have the cloexec
bit which is yet another syscall to invoke on the way to exec. A
better interface would instantly avoid the problem by not copying the
unwanted fds if not asked. For viability for use as foundation to
build posix_spawn over it such copying would have to be supported of
course.

>
> > There are numerous impactful ways to speed up execs both in terms of
> > single-threaded cost and their multicore scalability, most of which
> > would be immediately usable by all programs without an opt-in. imo these
> > needs to be exhausted before something like a "template" can be
> > considered.
>
> (I think probably a large part of this would be stuff that happens in
> userspace, like dynamic linking.)

I have not investigated userspace, even putting specific APIs aside
the kernel has *a lot* of avoidable overhead.

>
> > Per the above, the primary win would stem from *NOT* messing with mm.
>
> As you write below, I think we have that with CLONE_MM? The C function
> vfork() is kind of a terrible API because of its returns-twice
> behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> wrapped by libc in a way similar to clone() (with the child executing
> a separate handler function), or if it was used in the implementation
> of some higher-level process-spawning API, it would be a perfectly
> fine API?
>
> Or am I misunderstanding what you mean by "messing with mm"?
>

I was not aware of this functionality, let's assume it indeed works.
You still have the file issue described above.

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Florian Weimer 1 month, 1 week ago

* Jann Horn:

>> Per the above, the primary win would stem from *NOT* messing with mm.
>
> As you write below, I think we have that with CLONE_MM? The C function
> vfork() is kind of a terrible API because of its returns-twice
> behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> wrapped by libc in a way similar to clone() (with the child executing
> a separate handler function), or if it was used in the implementation
> of some higher-level process-spawning API, it would be a perfectly
> fine API?

No, there is still a problem with SIGTSTP handling because we cannot
atomically unmask the signal during execve.  We need to unblock SIGTSTP
before execve in the new process, but this means that it can get
suspended by SIGTSTP.  Consequently, the execve never happens and the
original process is stuck in vfork:

  posix_spawn: parent can get stuck in uninterruptible sleep if child
  receives SIGTSTP early enough
  <https://inbox.sourceware.org/libc-help/2921668c-773e-465d-9480-0abb6f979bf9@www.fastmail.com/>

More on the low-level side, it's difficult to make sure that execve gets
a consistent snapshot of the environ vector.  Both vfork and execve need
to be async-signal-safe.  Any locking or memory allocation (except for
the stack …) persists in the original process after vfork returns.  The
environ vector can be large, so making a copy on the stack is not ideal.
It's even harder for getenv/setenv/unsetenv implementations that use
locking instead of software transactional memory.

In general, I prefer the vfork+execve API over things like posix_spawn
because eventually, you have dependencies between the syslets, or need
control flow.  This introduces a lot of complexity.  Conceptually,
vfork+execve is much simpler, and in many ways quite safe (even mutexes
work as long as they do not need a correct TID).

Thanks,
Florian

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Jann Horn 1 month ago

On Tue, Jun 9, 2026 at 8:08 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Jann Horn:
>
> >> Per the above, the primary win would stem from *NOT* messing with mm.
> >
> > As you write below, I think we have that with CLONE_MM? The C function
> > vfork() is kind of a terrible API because of its returns-twice
> > behavior, but I think if process cloning with CLONE_VM|CLONE_VFORK was
> > wrapped by libc in a way similar to clone() (with the child executing
> > a separate handler function), or if it was used in the implementation
> > of some higher-level process-spawning API, it would be a perfectly
> > fine API?
>
> No, there is still a problem with SIGTSTP handling because we cannot
> atomically unmask the signal during execve.  We need to unblock SIGTSTP
> before execve in the new process, but this means that it can get
> suspended by SIGTSTP.  Consequently, the execve never happens and the
> original process is stuck in vfork:
>
>   posix_spawn: parent can get stuck in uninterruptible sleep if child
>   receives SIGTSTP early enough
>   <https://inbox.sourceware.org/libc-help/2921668c-773e-465d-9480-0abb6f979bf9@www.fastmail.com/>
>
> More on the low-level side, it's difficult to make sure that execve gets
> a consistent snapshot of the environ vector.  Both vfork and execve need
> to be async-signal-safe.  Any locking or memory allocation (except for
> the stack …) persists in the original process after vfork returns.  The

I think that's not entirely accurate; if you call set_robust_list() on
a futex list, then call execve(), the futexes should be released once
the process switches to a new MM, in
begin_new_exec -> exec_mmap -> exec_mm_release -> futex_exec_release
-> futex_cleanup -> exit_robust_list.

So in theory you could use clone() with CLONE_VM and without
CLONE_VFORK, and let the parent either wait for a futex that is
released on exec, or somehow asynchronously check later whether the
futex is still held... probably not the nicest building block but
maybe workable? Though I guess it would fit more nicely if there was a
"munmap() this range on exec" API...

> environ vector can be large, so making a copy on the stack is not ideal.
> It's even harder for getenv/setenv/unsetenv implementations that use
> locking instead of software transactional memory.

Makes sense, that kind of sounds like a pain inherent in being able to
execute from signal handler context...

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Li Chen 1 month, 2 weeks ago

Hi Mateusz,

 ---- On Thu, 28 May 2026 20:55:32 +0800  Mateusz Guzik <mjguzik@gmail.com> wrote --- 
 > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > This RFC adds spawn_template, a userspace-controlled exec acceleration
 > > mechanism for runtimes that repeatedly start the same executable with
 > > different argv, envp, and per-spawn file descriptor setup.
 > > 
 > > The main target is agent runtimes. Modern coding agents repeatedly start
 > > short-lived helper tools such as rg, git, sed, awk, python, node, and
 > > shell wrappers while they inspect and edit a workspace. Those runtimes
 > > already know which tools are hot, and they are also the right place to
 > > decide policy. The kernel does not choose names such as rg, git, or sed.
 > > Userspace opts in by creating a template fd for one executable, then uses
 > > that fd for later spawns. Launchers, shells, and build systems have a
 > > similar repeated-startup shape and could use the same primitive, but the
 > > agent runtime case is the main motivation for this RFC.
 > > 
 > [..]
 > > A typical agent runtime would keep one template per hot executable and
 > > still build argv, envp, cwd, and pipe wiring for each tool call:
 > > 
 > >     rg_tmpl = spawn_template_create("/usr/bin/rg");
 > > 
 > >     for each search request:
 > >         out_r, out_w = pipe_cloexec();
 > >         err_r, err_w = pipe_cloexec();
 > >         actions = [
 > >             FCHDIR(worktree_fd),
 > >             DUP2(out_w, STDOUT_FILENO),
 > >             DUP2(err_w, STDERR_FILENO),
 > >         ];
 > >         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
 > >         close(out_w);
 > >         close(err_w);
 > >         read out_r and err_r;
 > >         waitid(P_PIDFD, child.pidfd, ...);
 > > 
 > > 
 > [..]
 > > The cached state is intentionally small. The template fd keeps the opened
 > > main executable file, an optional absolute path string, the creator
 > > credential pointer, and the deny-write state. The executable identity key
 > > records device, inode, size, mode, owner, ctime, and mtime, and is
 > > rechecked before cached metadata is used. The ELF cache keeps only the
 > > main executable's ELF header, program header table, and program header
 > > count.
 > > 
 > >     cached in this RFC          not cached in this RFC
 > >     ------------------          ----------------------
 > >     opened main executable      PT_INTERP metadata
 > >     executable identity key     shared-library graph
 > >     main ELF header             VMA layout metadata
 > >     main ELF program headers    cross-process metadata sharing
 > >     creator cred pointer
 > >     deny-write state
 > > 
 > > This RFC does not cache ELF interpreter metadata, shared-library
 > > dependency state, or derived mapping-layout state. Shared-library
 > > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
 > > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
 > > state. It also does not share cached executable metadata between template
 > > fds created by different processes. Each template owns its small cached
 > > metadata object in this RFC.
 > > 
 > > Performance
 > > ===========
 > > 
 > [..]
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > > 
 > 
 > This problem is dear to my heart and I have been pondering it on and off
 > for some time now. The entire fork + exec idiom is terrible and needs tox
 > be retired.
 > 
 > Is this vibe-coded? I asked claude for in-kernel posix_spawn for kicks
 > some time ago and it generated remarkably similar code. But that's a
 > tangent.

Partly, yes. The original idea came from using agents myself and noticing
that they spend a lot of time starting short-lived tools such as rg, sed,
git, bash, and python. I was wondering whether repeated tool calls could
be made cheaper.

After that I used an LLM to bounce around the smallest kernel prototype
for the idea. I did some review, patch split, test, benchmark, leak-check work,
and throw away some cache codes that not actually useful.

 > I'm rather confused by the angle in the patchset. Most of this shaves
 > off a tiny amount of work, while retaining the primary avoidable reason
 > for bad performance: the very fact that fork is part of the picture,
 > especially the part mucking with mm. Creating a pristine process is the
 > way to go.
 > 
 > Additionally there is a known problem where transiently copied file
 > descriptors on fork + exec cause a headache in multithreaded programs
 > doing something like this in parallel. I only did cursory reading, it
 > seems your patchset keeps the same problem in place.
 > 
 > There are numerous impactful ways to speed up execs both in terms of
 > single-threaded cost and their multicore scalability, most of which
 > would be immediately usable by all programs without an opt-in. imo these
 > needs to be exhausted before something like a "template" can be
 > considered.
 > 
 > Per the above, the primary win would stem from *NOT* messing with mm.
 > 
 > As in, whatever the interface, it needs to create an "empty" target
 > process (for lack of a better term).
 > 
 > In terms of userspace-visible APIs, a clean solution escapes me.
 > 
 > Some time ago I proposed returning a handle which is populated over time
 > by the parnet-to-be. One of the problems with it I failed to consider at
 > the time is NUMA locality -- what if the process to be created is going
 > to run on another domain? For example, opening and installing a file for
 > its later use will result in avoidable loss of locality for some of the
 > in-kernel data. That's on top of the fd vs fork problem.
 > 
 > From perf standpoint, the final goal of whatever mechanism should be a
 > state where the target process avoided copying any state it did not need
 > to and which allocated any memory it needed from local NUMA node
 > (whatever it may happen to be). Of course if no affinity is assigned it
 > may happen to move again and lose such locality, nothing can be done
 > about that. But pretend the process is to run in a specific node the
 > parent is NOT running in.
 > 
 > So I think the pragmatic way forward is to implement something close to
 > posix_spawn in the kernel. It may make sense for the thing to take the
 > PATH argument for repeated exec attempts. I understand this is of no use
 > in your particular case, but it very much IS of use for most of the
 > real-world. The initial implementation might even start with doing vfork
 > just to get it off the ground.
 > 
 > The next step would be to extend the interface with means to AVOID
 > copying any file descriptors. There could be a dedicated file action
 > which tells the kernel to avoid such copies or something like a
 > close_range file action (or close_from) -- with a range like <0, INT_MAX>
 > you know no fds are copied.
 > 
 > For the NUMA angle to be sorted out, any file action which opens a file
 > or dups from the parent needs to execute in the child. And frankly
 > something would be needed to ask the scheduler where does it think the
 > child is going to run, so that the task_struct itself can also be
 > allocated with the right backing.
 > 
 > I have not looked into what's needed to create a new process and NOT
 > mess with mm, but I don't think there are unsolvable problems there, at
 > worst some churn.
 > 
 > There are of course other parameters which need to be sorted out, that's
 > covered by the posix_spawn thing.
 > 
 > This e-mail is long enough, so I'm not going to go into issues
 > concerning exec itself right now.
 > 
 > tl;dr I would suggest redoing the patchset as posix_spawn and then doing
 > the actual optimization of not cloning mm itself.
 > 

Thanks a lot for writing this up. I clearly had too narrow a view of the
problem. I was mostly thinking about repeated executable startup, but your
reply and Christian's and Andy's made me see that the more useful target is probably
a pidfd/pidfs-backed process builder which can sit under posix_spawn, and
then grow into something that avoids the fork-shaped mm and fd costs. I
learned a lot from this thread.

At a high level, Windows CreateProcess/NtCreateUserProcess also looks
closer to this direction than fork+exec: create the target process
directly, pass explicit startup attributes and handle inheritance state,
and avoid starting from a copy of the parent address space. That seems
to be the same basic advantage here: build the child closer to its final
shape instead of copying parent state and then throwing much of it away.

I will study the process creation, exec, pidfd/pidfs, and posix_spawn
codes more carefully, then try the direction you suggested
and benchmark the mm/fd costs.

Regards,
Li

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Christian Brauner 1 month, 2 weeks ago

On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> Hi,
> 
> This is an early RFC for an idea that is probably still rough in both the
> UAPI and implementation details. Sorry for the rough edges; I am sending
> it now to check whether this direction is worth pursuing and to get
> feedback on the kernel/userspace boundary.

The idea of having a builder api for exec isn't all that crazy. But it
should simply be built on top of pidfds and thus pidfs itself instead.
It has all the basic infrastructure in place already. Any implementation
should also allow userspace to implement posix_spawn() on top of it.

fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)

pidfd_config(fd, ...) // modeled similar to fsconfig()

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Andy Lutomirski 1 month, 1 week ago

On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > Hi,
> >
> > This is an early RFC for an idea that is probably still rough in both the
> > UAPI and implementation details. Sorry for the rough edges; I am sending
> > it now to check whether this direction is worth pursuing and to get
> > feedback on the kernel/userspace boundary.
>
> The idea of having a builder api for exec isn't all that crazy. But it
> should simply be built on top of pidfds and thus pidfs itself instead.
> It has all the basic infrastructure in place already. Any implementation
> should also allow userspace to implement posix_spawn() on top of it.
>
> fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
>
> pidfd_config(fd, ...) // modeled similar to fsconfig()
>

After contemplating this for a bit... why pidfd?  Doesn't a pidfd
refer to an actual process that is, or at least was, running?  This
new thing is a process that we are contemplating spawning.  I can
imagine that basically all pidfd APIs would be a bit confused by the
nonexistence of the process in question.

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Christian Brauner 1 month ago

On Mon, Jun 08, 2026 at 05:01:57PM -0700, Andy Lutomirski wrote:
> On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > > Hi,
> > >
> > > This is an early RFC for an idea that is probably still rough in both the
> > > UAPI and implementation details. Sorry for the rough edges; I am sending
> > > it now to check whether this direction is worth pursuing and to get
> > > feedback on the kernel/userspace boundary.
> >
> > The idea of having a builder api for exec isn't all that crazy. But it
> > should simply be built on top of pidfds and thus pidfs itself instead.
> > It has all the basic infrastructure in place already. Any implementation
> > should also allow userspace to implement posix_spawn() on top of it.
> >
> > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> >
> > pidfd_config(fd, ...) // modeled similar to fsconfig()
> >
> 
> After contemplating this for a bit... why pidfd?  Doesn't a pidfd
> refer to an actual process that is, or at least was, running?  This
> new thing is a process that we are contemplating spawning.  I can
> imagine that basically all pidfd APIs would be a bit confused by the
> nonexistence of the process in question.

I don't think that would be a problem because every api just needs to
handle ESRCH. Ignoring that for a second: the mount api has a builder fd
that is later transformed into a pidfd. Which is easily doable here as
well. My point is that all the infrastructure building blocks already
exist in pidfs.

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Li Chen 1 month, 1 week ago

Hi Andy,

 ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 4:05 AM Christian Brauner <brauner@kernel.org> wrote:
 > >
 > > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > > Hi,
 > > >
 > > > This is an early RFC for an idea that is probably still rough in both the
 > > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > > it now to check whether this direction is worth pursuing and to get
 > > > feedback on the kernel/userspace boundary.
 > >
 > > The idea of having a builder api for exec isn't all that crazy. But it
 > > should simply be built on top of pidfds and thus pidfs itself instead.
 > > It has all the basic infrastructure in place already. Any implementation
 > > should also allow userspace to implement posix_spawn() on top of it.
 > >
 > > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
 > >
 > > pidfd_config(fd, ...) // modeled similar to fsconfig()
 > >
 > 
 > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
 > refer to an actual process that is, or at least was, running?  This
 > new thing is a process that we are contemplating spawning.  I can
 > imagine that basically all pidfd APIs would be a bit confused by the
 > nonexistence of the process in question.
 > 

Yes, I think that is a real concern.                                                                                                                                                               

In my current local WIP I tried to keep that distinction explicit.                                     
pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
referring to a process. The builder fd is allocated as an anonymous pidfs                                                                                                                                        
file with builder-specific file operations:       

    file = pidfs_alloc_anon_file("[pidfd_spawn]",                                                      
                                 &pidfd_spawn_builder_fops, builder,      
                                 O_RDWR);                                                              

and the normal pidfd helpers still reject it because it does not use the
ordinary pidfd file operations:                                                                        

    struct pid *pidfd_pid(const struct file *file)
    {
        if (file->f_op != &pidfs_file_operations)                                                      
            return ERR_PTR(-EBADF);               
        return file_inode(file)->i_private;                                                                                                                                                                      
    }                                                                                                                                                                                                            

So the current split is:                                                                               

    builder_fd = pidfd_spawn_open(...);       /* builder object */
    pidfd_config(builder_fd, ...);     
    child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */

Only the last fd is a normal pidfd for an actual child process. The
builder fd is only accepted by the builder operations.                                                                                                                                                           

This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
pidfd_getfd(), poll(), etc. mean before the process exists. The downside                                                                                                                                         
is that it adds a separate open-style entry point and is less uniform than                                                                                                                                       
the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.                                                                                                                                                      

If people think there is a better way to represent the pre-spawn builder
state, or if the preference is to integrate it directly into pidfd_open()
with an explicit empty/future-pidfd state, I would be happy to discuss
that.

Regards,
Li

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by John Ericson 1 month, 1 week ago

On Tue, Jun 9, 2026, at 10:43 AM, Li Chen wrote:
> Hi Andy,
>
> ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote ---
> > [...]
> >
> > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
> > refer to an actual process that is, or at least was, running?  This
> > new thing is a process that we are contemplating spawning.  I can
> > imagine that basically all pidfd APIs would be a bit confused by the
> > nonexistence of the process in question.
> >
>
> Yes, I think that is a real concern.
>
> In my current local WIP I tried to keep that distinction explicit.
> pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
> referring to a process. The builder fd is allocated as an anonymous pidfs
> file with builder-specific file operations:
>
>     file = pidfs_alloc_anon_file("[pidfd_spawn]",
>                                  &pidfd_spawn_builder_fops, builder,
>                                  O_RDWR);
>

What does your builder fd point to, explicitly? For example in my other reply I
talked about how it was "real" process state. In my FreeBSD patch, for example,
I found there was already a status for a process "in exec", and I figured that
was clean to reuse for one of these "embryonic" processes that also hadn't
started running. I would reckon that Linux probably has some similar notions.

> and the normal pidfd helpers still reject it because it does not use the
> ordinary pidfd file operations:
>
>     struct pid *pidfd_pid(const struct file *file)
>     {
>         if (file->f_op != &pidfs_file_operations)
>             return ERR_PTR(-EBADF);
>         return file_inode(file)->i_private;
>     }
>
> So the current split is:
>
>     builder_fd = pidfd_spawn_open(...);       /* builder object */
>     pidfd_config(builder_fd, ...);
>     child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
>
> Only the last fd is a normal pidfd for an actual child process. The builder
> fd is only accepted by the builder operations.
>
> This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
> pidfd_getfd(), poll(), etc. mean before the process exists.

I wouldn't be so sure this is necessary/good. For example, I think it could
make sense to wait on a process that has yet to be started; one just waits for
both the process to start and the process to exit. Obviously a blocking syscall
in the thread that is spawning the process is not useful, but the asynchronous
poll variation seems fine.

As long as there is real process state here, it shouldn't be too hard to
implement.

> The downside is that it adds a separate open-style entry point and is less
> uniform than the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.

I do think there is no point having two file descriptors. The file descriptor
that previously referred to the builder/embryonic process then can refer to the
real process, right?

> If people think there is a better way to represent the pre-spawn builder
> state, or if the preference is to integrate it directly into pidfd_open()
> with an explicit empty/future-pidfd state, I would be happy to discuss that.

Hope the above answers your question? I suppose my ideas lean more on the
"future" than "empty" side --- there is indeed a thread in the thread group,
with real VM/namespace/file descriptor etc. state. Moreover, state gets
initialized before the process is started, so the actual start is a pretty
lightweight step of just letting the scheduler know the now-ready process can
be scheduled. The only thing that distinguishes the embryonic process from a
real one is simply that it isn't running --- i.e. isn't (yet) available to be
scheduled --- so the pidfds holders are free to poke at its state.

Cheers,

John

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Li Chen 1 month ago

Hi John,

 ---- On Wed, 10 Jun 2026 01:27:47 +0800  John Ericson <mail@johnericson.me> wrote --- 
 > 
 > 
 > On Tue, Jun 9, 2026, at 10:43 AM, Li Chen wrote:
 > > Hi Andy,
 > >
 > > ---- On Tue, 09 Jun 2026 08:01:57 +0800  Andy Lutomirski <luto@kernel.org> wrote ---
 > > > [...]
 > > >
 > > > After contemplating this for a bit... why pidfd?  Doesn't a pidfd
 > > > refer to an actual process that is, or at least was, running?  This
 > > > new thing is a process that we are contemplating spawning.  I can
 > > > imagine that basically all pidfd APIs would be a bit confused by the
 > > > nonexistence of the process in question.
 > > >
 > >
 > > Yes, I think that is a real concern.
 > >
 > > In my current local WIP I tried to keep that distinction explicit.
 > > pidfd_spawn_open() returns a pidfs-backed builder fd, not a normal pidfd
 > > referring to a process. The builder fd is allocated as an anonymous pidfs
 > > file with builder-specific file operations:
 > >
 > >     file = pidfs_alloc_anon_file("[pidfd_spawn]",
 > >                                  &pidfd_spawn_builder_fops, builder,
 > >                                  O_RDWR);
 > >
 > 
 > What does your builder fd point to, explicitly? For example in my other reply I
 > talked about how it was "real" process state. In my FreeBSD patch, for example,
 > I found there was already a status for a process "in exec", and I figured that
 > was clean to reuse for one of these "embryonic" processes that also hadn't
 > started running. I would reckon that Linux probably has some similar notions.
 > 
 > > and the normal pidfd helpers still reject it because it does not use the
 > > ordinary pidfd file operations:
 > >
 > >     struct pid *pidfd_pid(const struct file *file)
 > >     {
 > >         if (file->f_op != &pidfs_file_operations)
 > >             return ERR_PTR(-EBADF);
 > >         return file_inode(file)->i_private;
 > >     }
 > >
 > > So the current split is:
 > >
 > >     builder_fd = pidfd_spawn_open(...);       /* builder object */
 > >     pidfd_config(builder_fd, ...);
 > >     child_pidfd = pidfd_spawn_run(builder_fd, ...); /* real pidfd */
 > >
 > > Only the last fd is a normal pidfd for an actual child process. The builder
 > > fd is only accepted by the builder operations.
 > >
 > > This avoids having to define what waitid(P_PIDFD), pidfd_send_signal(),
 > > pidfd_getfd(), poll(), etc. mean before the process exists.
 > 
 > I wouldn't be so sure this is necessary/good. For example, I think it could
 > make sense to wait on a process that has yet to be started; one just waits for
 > both the process to start and the process to exit. Obviously a blocking syscall
 > in the thread that is spawning the process is not useful, but the asynchronous
 > poll variation seems fine.
 > 
 > As long as there is real process state here, it shouldn't be too hard to
 > implement.
 > 
 > > The downside is that it adds a separate open-style entry point and is less
 > > uniform than the pidfd_open(0, PIDFD_EMPTY) spelling Christian sketched.
 > 
 > I do think there is no point having two file descriptors. The file descriptor
 > that previously referred to the builder/embryonic process then can refer to the
 > real process, right?
 > 
 > > If people think there is a better way to represent the pre-spawn builder
 > > state, or if the preference is to integrate it directly into pidfd_open()
 > > with an explicit empty/future-pidfd state, I would be happy to discuss that.
 > 
 > Hope the above answers your question? I suppose my ideas lean more on the
 > "future" than "empty" side --- there is indeed a thread in the thread group,
 > with real VM/namespace/file descriptor etc. state. Moreover, state gets
 > initialized before the process is started, so the actual start is a pretty
 > lightweight step of just letting the scheduler know the now-ready process can
 > be scheduled. The only thing that distinguishes the embryonic process from a
 > real one is simply that it isn't running --- i.e. isn't (yet) available to be
 > scheduled --- so the pidfds holders are free to poke at its state.
 > 
 > Cheers,
 > 
 > John
 > 

Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
closer with P_LINTRANSIT, described as "process in exec or in creation".
Linux does not seem to have a single equivalent today: current->in_execve
is only an LSM hint, while the real synchronization is spread across
exec_update_lock, cred_guard_mutex, and the exec path.

I am switching my local WIP from the two-fd builder model to one fd,
closer to Christian's sketch:

fd = pidfd_open(0, PIDFD_EMPTY);
pidfd_config(fd, ...);
pidfd_spawn_run(fd, ...);

In my current local version, I still use copy_process(), so the fd points
at a real task_struct/pid that is not woken until run. Following
Christian's point that existing APIs can handle this not-yet-running case
with ESRCH, I currently make ordinary pidfd operations that need a real
started process return -ESRCH before start.

I am not sure yet whether Linux should grow a general exec/creation
transition state like that, or whether a narrower future-process
lifecycle is enough for this API. I will think more about that when
working on the pristine process version.

Regards,
Li

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Mateusz Guzik 1 month ago

On Wed, Jun 10, 2026 at 08:29:06PM +0800, Li Chen wrote:
>  ---- On Wed, 10 Jun 2026 01:27:47 +0800  John Ericson <mail@johnericson.me> wrote --- 
>  > Hope the above answers your question? I suppose my ideas lean more on the
>  > "future" than "empty" side --- there is indeed a thread in the thread group,
>  > with real VM/namespace/file descriptor etc. state. Moreover, state gets
>  > initialized before the process is started, so the actual start is a pretty
>  > lightweight step of just letting the scheduler know the now-ready process can
>  > be scheduled. The only thing that distinguishes the embryonic process from a
>  > real one is simply that it isn't running --- i.e. isn't (yet) available to be
>  > scheduled --- so the pidfds holders are free to poke at its state.
>  > 
> 
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.
> 
> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
> 
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);
> 
> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run. Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.
> 
> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.
> 

As I tried to explain in my previous e-mail this approach does not cut
it because of NUMA.

Suppose you have a machine with 2 nodes. The parent-to-be is running
on node 0 and the child is intended to exec something on node 1.

When the parent-to-be allocates and populates stuff, it takes place with
memory backed by node 0. If you allocate task_struct, the file table and
other frequently used (and modified!) objs in this way, you are
guaranteeing performance loss due to interconnect traffic to access it.

Trying to add plumbing so that all allocations respect numa placement is
probably too cumbersome.

The primary example for that is looking up the binary to exec in the
first place.

userspace likes to pass paths which don't exist, meaning checking for
the binary before any hard work is a useful optimizaiton. Suppose the
binary to be executed is in a container bound with a taskset using
node 1 and the content of the fs part of the container is currently
fully uncached.

When you perform the lookup on node 0, you are populating a bunch of
metadata (inode, dentry) using memory from that domain. But the intended
user will only execute on node 1, again resulting in a performance loss.

In order to not do it you would need to convince VFS to allocate memory
elsewhere.

So I stand by my previous claim that ultimately a pristine child has to
be created (like in this patch), but which also has to do the work on
its own.

Suppose there is no explicit placement requested anywhere. Even in that
case there are legitimate workloads which will eventually be forced to
exec stuff on another node. Even these have a better chance retaining
full locality if the child process does all the work.

Per my previous message I don't see a clean interface to do it.
something quasi-posix_spawn is probably the least bad way out, it will
also allow userspace to easily wrap the new thing with posix_spawn
itself.

Also note there is another issue with the fd-based approach: the fd will
get inherited on fork and will hang out in the child afterwards unless
explicitly closed. Suppose you have a multithreaded program which likes
to both fork(+no exec) and fork+exec. With the fd-based approach you
have no means of stopping another thread from grabbing your state thanks
to unix defaulting to copying everything. There was an attempt to fix
this aspect with O_CLOFORK, but this got rejected.

Whatever exactly happens, NUMA is a sad fact of computing and needs to
be accounted for. The approach as proposed not only does not do it, but
it actively hinders such deployments.

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by John Ericson 1 month ago

On Wed, Jun 10, 2026, at 8:29 AM, Li Chen wrote:
> Hi John,
>
> [...]
>
> Thanks, this helped a lot. I looked at FreeBSD/OpenBSD/XNU after your
> note. FreeBSD has P_INEXEC, OpenBSD has PS_INEXEC, and XNU seems even
> closer with P_LINTRANSIT, described as "process in exec or in creation".
> Linux does not seem to have a single equivalent today: current->in_execve
> is only an LSM hint, while the real synchronization is spread across
> exec_update_lock, cred_guard_mutex, and the exec path.

Great! Glad to hear my suggestion (and the patch too I linked in the
other email, I hope?) was useful.

> I am switching my local WIP from the two-fd builder model to one fd,
> closer to Christian's sketch:
>
> fd = pidfd_open(0, PIDFD_EMPTY);
> pidfd_config(fd, ...);
> pidfd_spawn_run(fd, ...);

Glad to hear it is also one-fd now.

> In my current local version, I still use copy_process(), so the fd points
> at a real task_struct/pid that is not woken until run.

So this is an interesting thing to think about. My hunch is that
`copy_process` is, at least in the longer term, still doing too much! In
particular, `struct kernel_clone_args` has many degrees of freedom, and
might also make assumptions about preserving more of the parent process
than is needed in this case.

This is a bit tangential, but one thing I have thought about is having
"null namespaces". I think the current (i.e. existing clone API) default
of "share with parent process" is a poor security practice (more
privileges, i.e. sharing, should always be opt-in). But the opposite
default of "unshare everything" is expensive since creating new
namespaces is non-free. The goal of the null namespaces would be a cheap
way of creating a more isolated and unprivileged process — and "cheap"
here is literal: a null pointer in `nsproxy`, no allocation, no
namespace object, no ID. This null state would be what
`pidfd_open(0, PIDFD_EMPTY)` (using your example above, or really
whatever the first step is) hands back.

Then, from that maximally cheap and unprivileged initial state, the
`pidfd_config(fd, ...);` calls (plural important, I think!) would opt
into either sharing or unsharing namespaces between the child and parent
as the parent sees fit.

The larger point here is that insofar as there are not good defaults for
things, there is pressure, whether in step 1 or step 2, to make larger
everything-at-once configuration. But when we think a bit outside the
box to create the good defaults where they didn't previously exist, we
can end up in a situation where a minimal initial blank unstarted
process, and the builder pattern to initialize it, are more "natural".

> Following
> Christian's point that existing APIs can handle this not-yet-running case
> with ESRCH, I currently make ordinary pidfd operations that need a real
> started process return -ESRCH before start.

Also glad to hear.

> I am not sure yet whether Linux should grow a general exec/creation
> transition state like that, or whether a narrower future-process
> lifecycle is enough for this API. I will think more about that when
> working on the pristine process version.

Sounds good, as I think you can guess, my preference is for "yes", but I
agree we can see what you end up with in the next patchset and make more
informed decisions based on that.

Cheers,

John

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Kees Cook 1 month, 2 weeks ago

On Thu, May 28, 2026 at 01:02:53PM +0200, Christian Brauner wrote:
> On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > Hi,
> > 
> > This is an early RFC for an idea that is probably still rough in both the
> > UAPI and implementation details. Sorry for the rough edges; I am sending
> > it now to check whether this direction is worth pursuing and to get
> > feedback on the kernel/userspace boundary.
> 
> The idea of having a builder api for exec isn't all that crazy. But it
> should simply be built on top of pidfds and thus pidfs itself instead.
> It has all the basic infrastructure in place already. Any implementation
> should also allow userspace to implement posix_spawn() on top of it.
> 
> fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> 
> pidfd_config(fd, ...) // modeled similar to fsconfig()

FWIW, I agree this should be modelled after fsconfig and built on pidfs.
Doing so will avoid a bunch of design issues, etc.

-Kees

-- 
Kees Cook

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by Li Chen 1 month, 2 weeks ago

Hi Christian,

Thanks a lot for your great review!

 ---- On Thu, 28 May 2026 19:02:53 +0800  Christian Brauner <brauner@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > Hi,
 > > 
 > > This is an early RFC for an idea that is probably still rough in both the
 > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > it now to check whether this direction is worth pursuing and to get
 > > feedback on the kernel/userspace boundary.
 > 
 > The idea of having a builder api for exec isn't all that crazy. But it
 > should simply be built on top of pidfds and thus pidfs itself instead.
 > It has all the basic infrastructure in place already.

Yes, that makes a lot more sense. I was staring too hard at the "hot
executable" part and made the cache/template the API, which was probably
the wrong thing to expose. Sorry about that.

 > Any implementation
 > should also allow userspace to implement posix_spawn() on top of it.

That's so cool, and this is a really useful point. I had not thought about this as
something that could sit under posix_spawn(), but that makes the target
much clearer. It should be a generic exec/spawn builder first, and the
agent use case should just be one user of it.

 > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
 > 
 > pidfd_config(fd, ...) // modeled similar to fsconfig()

Reusing pidfd_open() with an empty target is nice because it keeps the API close
to pidfds, but I wonder if a separate entry point such as
pidfd_spawn_open() or pidfd_create() would make the "new process
builder" case a bit more explicit? Either way, the configuration side
being fsconfig-like makes sense to me.

Thanks again for pointing me in this direction. It helps a lot.

Regards,
Li

Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup

Posted by John Ericson 1 month, 1 week ago

Hi all,

I am happy to see this thread appear. I emailed Christian and others ~5 years
ago about this in this thread[1]; it would be great to see it finally happen!

I very much agree that the new process spawning should be pidfd based. I also
want to emphasize that the crux of the matter is that code needed to set up the
initial unscheduled process --- which I do think should be "real state" and
more than a mere template --- is currently chopped up between clone and exec.
So the real meat of the implementation would be factoring out a bunch of stuff
so it can be reused in both the legacy clone+exec and modern code paths.

I'll say a bit more about this "real state" vs "mere template" distinction,
which is that the latter is effectively some sort of ad-hoc operation batching
language, and always runs the risk of falling behind what the kernel actually
supports. The "real state" approach, where we have honest-to-goodness process
state, just in some partially initialized fashion and thus it's not yet
scheduled, always supports everything the kernel supports in principle.

Yes, alternative syscalls that specify which "embryonic" process (as opposed to
always the current active process) need to be created, but that is less bad
than trying to stuff things into flags etc. for a single existing system call,
and also one can imagine a world (as described in
https://catern.com/rsys21.pdf) where the exact "which process?" parameter
starts getting added to new process modifying machinery by *default*, with a
sentinel value analogous to `AT_FDCWD` used to mean "the current process" for
the legacy used-between-fork-and-exec usecase.

---

Anyways, years ago, after taking a glance at the relevant code in Linux and
FreeBSD, I figured that it would be easier for me personally to first implement
this functionality in FreeBSD, and then, once I had a feel for some of the
refactoring, take a stab at it in Linux. This is because Linux's feature set,
especially things like `binfmt_misc`, makes its clone and exec quite a bit more
complex, and thus the (IMO) necessary heavy refactoring quite a bit more
extensive too.

I never got around to it in the 5 years, but these days, with LLMs, doing an
"exploratory refactor" (to get a sketch of a patch that is fodder for discussion
not yet fit for actual submission) is much easier. So inspired by this thread, I
took a few hours to do the exploratory FreeBSD refactor in [2]. The man page for
the new syscalls, [3], might be a good place to start reading. (This, being from
a FreeBSD patch, describes the change in terms of "proc fds", but the switch to
Linux's "pidfds" should be self-explanatory. The former after all inspired the
latter.)

Hope discussion of such a patch isn't too off topic here, but there is an
interesting thing to note that would also apply to a Linux implementation. It
took *more* factored out helper functions than I thought. The current count is
over 15(!) --- there didn't seem to be a way to build both the old and new way
of doing things with fewer, coarser building blocks. Now, granted, maybe
someone more familiar with either kernel than me could do a better job, but I
think it will still be a number of functions. This indicates just how much
untangling there is to do. And the number will surely be much higher for Linux.

[1]: https://lore.kernel.org/all/f8457e20-c3cc-6e56-96a4-3090d7da0cb6@JohnEricson.me/

[2]: https://github.com/obsidiansystems/freebsd-src/commit/better-proc-spawn
     239dcdefe6ad244e58d998155b527375e5293ff7 for posterity

[3]: https://raw.githubusercontent.com/obsidiansystems/freebsd-src/refs/heads/better-proc-spawn/lib/libsys/proc_new.2

On Sun, May 31, 2026, at 10:47 PM, Li Chen wrote:
> Hi Christian,
>
> Thanks a lot for your great review!
>
> ---- On Thu, 28 May 2026 19:02:53 +0800  Christian Brauner <brauner@kernel.org> wrote ---
> > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
> > > Hi,
> > >
> > > This is an early RFC for an idea that is probably still rough in both the
> > > UAPI and implementation details. Sorry for the rough edges; I am sending
> > > it now to check whether this direction is worth pursuing and to get
> > > feedback on the kernel/userspace boundary.
> >
> > The idea of having a builder api for exec isn't all that crazy. But it
> > should simply be built on top of pidfds and thus pidfs itself instead.
> > It has all the basic infrastructure in place already.
>
> Yes, that makes a lot more sense. I was staring too hard at the "hot
> executable" part and made the cache/template the API, which was probably
> the wrong thing to expose. Sorry about that.
>
> > Any implementation
> > should also allow userspace to implement posix_spawn() on top of it.
>
> That's so cool, and this is a really useful point. I had not thought about this as
> something that could sit under posix_spawn(), but that makes the target
> much clearer. It should be a generic exec/spawn builder first, and the
> agent use case should just be one user of it.
>
> > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
> >
> > pidfd_config(fd, ...) // modeled similar to fsconfig()
>
> Reusing pidfd_open() with an empty target is nice because it keeps the API close
> to pidfds, but I wonder if a separate entry point such as
> pidfd_spawn_open() or pidfd_create() would make the "new process
> builder" case a bit more explicit? Either way, the configuration side
> being fsconfig-like makes sense to me.

Yeah check out my syscalls [3] on that front. It's important to design the
workflow / state machine in a good way. Performance/efficiency, security (share
less state/privileges by default!), and extensibility (where will newer
concepts, like a new type of namespace, fit in?) are all competing concerns,
but I think they mostly pull in the same direction. (Only no ambient authority,
back compat, and extensibility exist in some tension.)

> Thanks again for pointing me in this direction. It helps a lot.
>
> Regards,
> Li

Glad you are sold on pidfds, and more broadly, best of luck! You'll be a hero
to everyone else that has wanted this over the years :)

John