Kees, Andy, et al, please comment. I think the usage of syscall_rollback()
in __seccomp_filter() is not right.
This is just RFC.
In fact I think that syscall_exit_work() should do nothing if a
syscall was rejected with force_sig_seccomp() by __seccomp_filter().
If nothing else, the syscall was never actually executed.
Perhaps we can add a new SYSCALL_WORK_SYSCALL_XXX to SYSCALL_WORK_EXIT.
seccomp_nack_syscall() can set this flag, and syscall_exit_work() can do
if (work & SYSCALL_WORK_SYSCALL_XXX) {
clear_syscall_work(SYSCALL_XXX); // for the !force_coredump case
return;
}
after the "if (SYSCALL_WORK_SYSCALL_USER_DISPATCH)" block.
But I didn't dare to do such a change.
What do you think?
Oleg.
On 04/14, Oleg Nesterov wrote:
>
> Kees, Andy, et al, please comment. I think the usage of syscall_rollback()
> in __seccomp_filter() is not right.
I'll recheck, but in fact this logic looks broken... force_sig_seccomp() assumes
that it can't race with (say) SIGSEGV which has a handler. And 2/2 makes the things
slightly worse. So self-nack for now.
> In fact I think that syscall_exit_work() should do nothing if a
> syscall was rejected with force_sig_seccomp() by __seccomp_filter().
> If nothing else, the syscall was never actually executed.
>
> Perhaps we can add a new SYSCALL_WORK_SYSCALL_XXX to SYSCALL_WORK_EXIT.
> seccomp_nack_syscall() can set this flag, and syscall_exit_work() can do
>
> if (work & SYSCALL_WORK_SYSCALL_XXX) {
> clear_syscall_work(SYSCALL_XXX); // for the !force_coredump case
> return;
> }
>
> after the "if (SYSCALL_WORK_SYSCALL_USER_DISPATCH)" block.
>
> But I didn't dare to do such a change.
>
> What do you think?
I'll try to send a patch based on above this week.
Oleg.
On Wed, Apr 15, 2026 at 12:44:25PM +0200, Oleg Nesterov wrote:
> On 04/14, Oleg Nesterov wrote:
> >
> > Kees, Andy, et al, please comment. I think the usage of syscall_rollback()
> > in __seccomp_filter() is not right.
>
> I'll recheck, but in fact this logic looks broken... force_sig_seccomp() assumes
> that it can't race with (say) SIGSEGV which has a handler. And 2/2 makes the things
> slightly worse. So self-nack for now.
I've spent some more time looking at all this. It does seem to me that
dropping syscall_exit_work() entirely for killed syscalls is the right way
to go for fixing the audit/trace/ptrace confusion on the exit side. But
I don't think it closes the whole problem. Apologies for any verbosity
here, I'm kind of taking notes for myself too. :)
Once the spurious exit-stop is gone, the sequence for RET_KILL becomes:
- entry-stop for syscall (tracer sees "entry")
- tracer PTRACE_SYSCALLs
- seccomp RET_KILLs the syscall; no exit-stop
- get_signal() dequeues SIGSYS
- SA_IMMUTABLE means ptrace_signal() skipped and no signal-delivery stop
- task dies (and maybe coredumps)
- tracer's next waitpid() returns WIFSIGNALED, WTERMSIG==SIGSYS
(and maybe WCOREDUMP==1)
This view is technically correct (entry then death), but the
tracer has no visibility into what happened as the whole siginfo that
force_sig_seccomp() assembled never reaches the tracer due to SA_IMMUTABLE
being set, since get_signal() short-circuits ptrace_signal() entirely.
RET_TRAP doesn't have this problem (force_coredump=false, no SA_IMMUTABLE,
the tracer sees a normal signal-delivery stop for SIGSYS). So the
asymmetry is specifically the RET_KILL path, AIUI.
I was trying to consider whether fixing this with a new ptrace event
(PTRACE_EVENT_SECCOMP_KILL or a new PTRACE_SYSCALL_INFO op) would be
better than reusing the existing signal-delivery stop (but perhaps in a
"read-only" mode). My sense is that a new event isn't worth it, because
the mutation surface a tracer can reach would be nearly the same:
Tracer action | signal-stop for SIGSYS | new event
---------------+-------------------------------+--------------------------
CONT sig=0 | Direct mutation, must reject. | N/A (SIGSYS still queued)
CONT sig=X | Direct mutation, must reject. | Injects X racing SIGSYS,
| | must reject.
SETSIGINFO | Mutates last_siginfo, | last_siginfo unset,
| must reject or restore. | mostly no-op?
SETREGS | Corrupts coredump view, | Same.
| should reject or restore. |
POKEDATA | Info only, doesn't matter. | Same.
The only thing the new event gets for free is "can't suppress/replace
the stopping signal," which is a single check to enforce in the sig-stop
approach (ignore exit_code on resume). Register and siginfo mutation is
identical in both. On the other hand, the sig-stop approach doesn't change
ABI and existing tracers already handle SYS_SECCOMP siginfo because they
see it on RET_TRAP today.
So I'm thinking the full fix is to change what SA_IMMUTABLE actually
means: instead of "ptrace is disabled", it can be "the signal cannot
be changed (i.e. cannot stop the kill)". Which means in get_signal()
at the SA_IMMUTABLE check, stop gating ptrace_signal() on the flag and
instead pass the flag into ptrace_signal() (or check in other places) so
it can run in a "read-only" mode. I think refusing tracer actions would
be best, but perhaps just snapshot all the things we don't want changed?
For example:
- Snapshot ksig->info and the relevant pt_regs before ptrace_stop().
- After resume, if the immutable flag was set:
- ignore current->exit_code; keep the original signr (no
suppression, no replacement);
- restore ksig->info from the snapshot (SETSIGINFO is ignored);
- restore pt_regs from the snapshot so the coredump still sees
the original syscall attempt that syscall_rollback() set up.
- Leave POKEDATA alone: it's not a security concern, AFAICT.
This preserves what SA_IMMUTABLE was actually meant to guarantee (the
tracee dies, from SIGSYS, with the coredump reflecting the attempted
syscall) while giving the tracer the observation point they need. rr,
strace, and gdb all already know how to read SYS_SECCOMP siginfo from
a SIGSYS stop, so there's nothing to teach them. However, they may not
be expecting the stop, which is the only part we'd need to double check.
So, tl;dr:
- syscall_exit_work() skips the exit tracehook, audit, and trace
when the syscall was RET_KILLed.
- SA_IMMUTABLE stops disabling ptrace_signal() and starts gating
mutations within it.
What do you think?
-Kees
--
Kees Cook
On 04/15, Kees Cook wrote: > > I've spent some more time looking at all this. It does seem to me that > dropping syscall_exit_work() entirely for killed syscalls is the right way > to go for fixing the audit/trace/ptrace confusion on the exit side. OK, great. I'll try to make a patch as soon I have time. Hopefully this week. > But > I don't think it closes the whole problem. I guess we can discuss this in more detail later/separately? > Apologies for any verbosity > here, I'm kind of taking notes for myself too. :) Thanks for the detailed email ;) I will snip some parts for now... > I was trying to consider whether fixing this with a new ptrace event > (PTRACE_EVENT_SECCOMP_KILL or a new PTRACE_SYSCALL_INFO op) would be > better than reusing the existing signal-delivery stop (but perhaps in a > "read-only" mode). My sense is that a new event isn't worth it, Agreed, > So I'm thinking the full fix is to change what SA_IMMUTABLE actually > means: instead of "ptrace is disabled", it can be "the signal cannot > be changed (i.e. cannot stop the kill)". Which means in get_signal() > at the SA_IMMUTABLE check, stop gating ptrace_signal() on the flag and > instead pass the flag into ptrace_signal() (or check in other places) so > it can run in a "read-only" mode. OK, we can add something like PT_FREEZED which leaves in task->ptrace, but see below. > I think refusing tracer actions would be best, agreed > - syscall_exit_work() skips the exit tracehook, audit, and trace > when the syscall was RET_KILLed. Good ;) > - SA_IMMUTABLE stops disabling ptrace_signal() and starts gating > mutations within it. Honestly, I am not sure this is really useful... But I do not know. And again, we can discuss this later I hope. ---------------------------------------------------------------------- Now a stupid question ;) Why does __seccomp_filter() use syscall_rollback() anyway? OK, may be ax == orig_ax makes sense for coredump, I dunno. But case SECCOMP_RET_TRAP: /* Show the handler the original registers. */ syscall_rollback(current, current_pt_regs()); /* Let the filter pass back 16 bits of data. */ force_sig_seccomp(this_syscall, data, false); the handler can just use info.si_syscall instead of sigcontext.rax ? Oleg.
On Wed, Apr 15, 2026 at 12:44:25PM +0200, Oleg Nesterov wrote: > On 04/14, Oleg Nesterov wrote: > > > > Kees, Andy, et al, please comment. I think the usage of syscall_rollback() > > in __seccomp_filter() is not right. > > I'll recheck, but in fact this logic looks broken... force_sig_seccomp() assumes > that it can't race with (say) SIGSEGV which has a handler. And 2/2 makes the things > slightly worse. So self-nack for now. Oh, I just read this now. Yeah, that's a good point. Hrmpf. A corner case, but yeah, the proposed change makes things worse. -- Kees Cook
© 2016 - 2026 Red Hat, Inc.