include/linux/entry-common.h | 3 +++ include/linux/seccomp.h | 8 ++++++++ kernel/seccomp.c | 3 --- 3 files changed, 11 insertions(+), 3 deletions(-)
__seccomp_filter() does
case SECCOMP_RET_KILL_THREAD:
case SECCOMP_RET_KILL_PROCESS:
...
/* Show the original registers in the dump. */
syscall_rollback(current, current_pt_regs());
/* Trigger a coredump with SIGSYS */
force_sig_seccomp(this_syscall, data, true);
syscall_rollback() does regs->ax == orig_ax. This means that
ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer,
it looks as if the aborted syscall actually succeeded and returned its
own syscall number.
And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't
be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee
will "silently" exit with error_code == SIGSYS after the bogus report.
Change syscall_exit_work() to avoid the bogus single-step/syscall-exit
reports if the tracee is SECCOMP_MODE_DEAD.
TODO: With or without this change, get_signal() -> ptrace_signal() may
report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD
too and prioritize the fatal SIGSYS.
Reported-by: Max Ver <dudududumaxver@gmail.com>
Closes: https://lore.kernel.org/all/CABjJbFJO+p3jA1r0gjUZrCepQb1Fab3kqxYhc_PSfoqo21ypeQ@mail.gmail.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/entry-common.h | 3 +++
include/linux/seccomp.h | 8 ++++++++
kernel/seccomp.c | 3 ---
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f83ca0abf2cd..5c62bda9dcf9 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -250,6 +250,9 @@ static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned lon
if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
trace_syscall_exit(regs, syscall_get_return_value(current, regs));
+ if (killed_by_seccomp(current))
+ return;
+
step = report_single_step(work);
if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
arch_ptrace_report_syscall_exit(regs, step);
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9b959972bf4a..e95a251955c1 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -22,6 +22,12 @@
#include <linux/atomic.h>
#include <asm/seccomp.h>
+/* Not exposed in uapi headers: internal use only. */
+#define SECCOMP_MODE_DEAD (SECCOMP_MODE_FILTER + 1)
+
+#define killed_by_seccomp(task) \
+ ((task)->seccomp.mode == SECCOMP_MODE_DEAD)
+
extern int __secure_computing(void);
#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
@@ -49,6 +55,8 @@ static inline int seccomp_mode(struct seccomp *s)
struct seccomp_data;
+#define killed_by_seccomp(task) 0
+
#ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
static inline int secure_computing(void) { return 0; }
#else
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 066909393c38..461eb15c66c3 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -31,9 +31,6 @@
#include <asm/syscall.h>
-/* Not exposed in headers: strictly internal use only. */
-#define SECCOMP_MODE_DEAD (SECCOMP_MODE_FILTER + 1)
-
#ifdef CONFIG_SECCOMP_FILTER
#include <linux/file.h>
#include <linux/filter.h>
--
2.52.0
On Sun, 22 Mar 2026 14:44:54 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > __seccomp_filter() does > > case SECCOMP_RET_KILL_THREAD: > case SECCOMP_RET_KILL_PROCESS: > ... > /* Show the original registers in the dump. */ > syscall_rollback(current, current_pt_regs()); > > /* Trigger a coredump with SIGSYS */ > force_sig_seccomp(this_syscall, data, true); > > syscall_rollback() does regs->ax == orig_ax. This means that > ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer, > it looks as if the aborted syscall actually succeeded and returned its > own syscall number. > > And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't > be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee > will "silently" exit with error_code == SIGSYS after the bogus report. > > Change syscall_exit_work() to avoid the bogus single-step/syscall-exit > reports if the tracee is SECCOMP_MODE_DEAD. > > TODO: With or without this change, get_signal() -> ptrace_signal() may > report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS. > Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD > too and prioritize the fatal SIGSYS. AI review has questions: https://sashiko.dev/#/patchset/ab_yVqQ7WW3flal3@redhat.com
On 03/22, Andrew Morton wrote: > > On Sun, 22 Mar 2026 14:44:54 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > > > __seccomp_filter() does > > > > case SECCOMP_RET_KILL_THREAD: > > case SECCOMP_RET_KILL_PROCESS: > > ... > > /* Show the original registers in the dump. */ > > syscall_rollback(current, current_pt_regs()); > > > > /* Trigger a coredump with SIGSYS */ > > force_sig_seccomp(this_syscall, data, true); > > > > syscall_rollback() does regs->ax == orig_ax. This means that > > ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer, > > it looks as if the aborted syscall actually succeeded and returned its > > own syscall number. > > > > And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't > > be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee > > will "silently" exit with error_code == SIGSYS after the bogus report. > > > > Change syscall_exit_work() to avoid the bogus single-step/syscall-exit > > reports if the tracee is SECCOMP_MODE_DEAD. > > > > TODO: With or without this change, get_signal() -> ptrace_signal() may > > report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS. > > Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD > > too and prioritize the fatal SIGSYS. > > AI review has questions: > https://sashiko.dev/#/patchset/ab_yVqQ7WW3flal3@redhat.com Excellent question ;) Thanks sashiko! I will have this in mind when (if) I send V2. So far my main concern is the behavioral change caused by my RFC, I will wait for more comments before that. In any case: yes! I have missed another syscall_rollback() on SECCOMP_RET_TRAP in __seccomp_filter(). In this case force_sig_seccomp() uses force_coredump == false, so SIGSYS will be reported. But this doesn't really make a difference wrt ptrace confusion. Thanks! Oleg.
On March 22, 2026 6:44:54 AM PDT, Oleg Nesterov <oleg@redhat.com> wrote:
>__seccomp_filter() does
>
> case SECCOMP_RET_KILL_THREAD:
> case SECCOMP_RET_KILL_PROCESS:
> ...
> /* Show the original registers in the dump. */
> syscall_rollback(current, current_pt_regs());
>
> /* Trigger a coredump with SIGSYS */
> force_sig_seccomp(this_syscall, data, true);
>
>syscall_rollback() does regs->ax == orig_ax. This means that
>ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer,
>it looks as if the aborted syscall actually succeeded and returned its
>own syscall number.
>
>And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't
>be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee
>will "silently" exit with error_code == SIGSYS after the bogus report.
>
>Change syscall_exit_work() to avoid the bogus single-step/syscall-exit
>reports if the tracee is SECCOMP_MODE_DEAD.
>
>TODO: With or without this change, get_signal() -> ptrace_signal() may
>report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
>Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD
>too and prioritize the fatal SIGSYS.
>
>Reported-by: Max Ver <dudududumaxver@gmail.com>
>Closes: https://lore.kernel.org/all/CABjJbFJO+p3jA1r0gjUZrCepQb1Fab3kqxYhc_PSfoqo21ypeQ@mail.gmail.com/
>Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>---
> include/linux/entry-common.h | 3 +++
> include/linux/seccomp.h | 8 ++++++++
> kernel/seccomp.c | 3 ---
> 3 files changed, 11 insertions(+), 3 deletions(-)
>
>diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
>index f83ca0abf2cd..5c62bda9dcf9 100644
>--- a/include/linux/entry-common.h
>+++ b/include/linux/entry-common.h
>@@ -250,6 +250,9 @@ static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned lon
> if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
> trace_syscall_exit(regs, syscall_get_return_value(current, regs));
>
>+ if (killed_by_seccomp(current))
>+ return;
Hmm. I'm still not convinced this is right, but if we make this change, I'd want to see a behavioral test added (likely to the seccomp self tests), and to make sure the rr test suite doesn't regress. It's traditionally been the most sensitive to these kinds of notification ordering/behavior changes.
-Kees
>+
> step = report_single_step(work);
> if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
> arch_ptrace_report_syscall_exit(regs, step);
>diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>index 9b959972bf4a..e95a251955c1 100644
>--- a/include/linux/seccomp.h
>+++ b/include/linux/seccomp.h
>@@ -22,6 +22,12 @@
> #include <linux/atomic.h>
> #include <asm/seccomp.h>
>
>+/* Not exposed in uapi headers: internal use only. */
>+#define SECCOMP_MODE_DEAD (SECCOMP_MODE_FILTER + 1)
>+
>+#define killed_by_seccomp(task) \
>+ ((task)->seccomp.mode == SECCOMP_MODE_DEAD)
>+
> extern int __secure_computing(void);
>
> #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
>@@ -49,6 +55,8 @@ static inline int seccomp_mode(struct seccomp *s)
>
> struct seccomp_data;
>
>+#define killed_by_seccomp(task) 0
>+
> #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> static inline int secure_computing(void) { return 0; }
> #else
>diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>index 066909393c38..461eb15c66c3 100644
>--- a/kernel/seccomp.c
>+++ b/kernel/seccomp.c
>@@ -31,9 +31,6 @@
>
> #include <asm/syscall.h>
>
>-/* Not exposed in headers: strictly internal use only. */
>-#define SECCOMP_MODE_DEAD (SECCOMP_MODE_FILTER + 1)
>-
> #ifdef CONFIG_SECCOMP_FILTER
> #include <linux/file.h>
> #include <linux/filter.h>
--
Kees Cook
On 03/22, Kees Cook wrote: > > On March 22, 2026 6:44:54 AM PDT, Oleg Nesterov <oleg@redhat.com> wrote: > >__seccomp_filter() does > > > > case SECCOMP_RET_KILL_THREAD: > > case SECCOMP_RET_KILL_PROCESS: > > ... > > /* Show the original registers in the dump. */ > > syscall_rollback(current, current_pt_regs()); > > > > /* Trigger a coredump with SIGSYS */ > > force_sig_seccomp(this_syscall, data, true); > > > >syscall_rollback() does regs->ax == orig_ax. This means that > >ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer, > >it looks as if the aborted syscall actually succeeded and returned its > >own syscall number. > > > >And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't > >be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee > >will "silently" exit with error_code == SIGSYS after the bogus report. > > > >Change syscall_exit_work() to avoid the bogus single-step/syscall-exit > >reports if the tracee is SECCOMP_MODE_DEAD. > > > >TODO: With or without this change, get_signal() -> ptrace_signal() may > >report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS. > >Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD > >too and prioritize the fatal SIGSYS. > > > >Reported-by: Max Ver <dudududumaxver@gmail.com> > >Closes: https://lore.kernel.org/all/CABjJbFJO+p3jA1r0gjUZrCepQb1Fab3kqxYhc_PSfoqo21ypeQ@mail.gmail.com/ > >Signed-off-by: Oleg Nesterov <oleg@redhat.com> > >--- > > include/linux/entry-common.h | 3 +++ > > include/linux/seccomp.h | 8 ++++++++ > > kernel/seccomp.c | 3 --- > > 3 files changed, 11 insertions(+), 3 deletions(-) > > > >diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h > >index f83ca0abf2cd..5c62bda9dcf9 100644 > >--- a/include/linux/entry-common.h > >+++ b/include/linux/entry-common.h > >@@ -250,6 +250,9 @@ static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned lon > > if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT) > > trace_syscall_exit(regs, syscall_get_return_value(current, regs)); > > > >+ if (killed_by_seccomp(current)) > >+ return; > > Hmm. I'm still not convinced this is right, Me too actually ;) That is why RFC. So: - Do you agree that the current behaviour is not really "sane" and can confuse ptracers? - If yes, what else do you think we can do? No, I no longer think it makes sense to change the ptrace_get_syscall_info_exit() paths... > but if we make this change, I'd want to see a behavioral test added > (likely to the seccomp self tests), and to make sure the rr test suite doesn't regress. OK. I'll try to take a look at these tests and possibly add another one. But (sorry) not the next week, I will be travelling. Oleg.
On 03/22, Oleg Nesterov wrote:
>
> On 03/22, Kees Cook wrote:
> >
> > Hmm. I'm still not convinced this is right,
>
> Me too actually ;)
>
> That is why RFC. So:
>
> - Do you agree that the current behaviour is not really "sane" and
> can confuse ptracers?
>
> - If yes, what else do you think we can do? No, I no longer think it
> makes sense to change the ptrace_get_syscall_info_exit() paths...
Perhaps _something_ like the change below makes more sense?
Oleg.
--- x/kernel/seccomp.c
+++ x/kernel/seccomp.c
@@ -1357,8 +1357,8 @@ static int __seccomp_filter(int this_sys
/* Dump core only if this is the last remaining thread. */
if (action != SECCOMP_RET_KILL_THREAD ||
(atomic_read(¤t->signal->live) == 1)) {
- /* Show the original registers in the dump. */
- syscall_rollback(current, current_pt_regs());
+ syscall_set_return_value(current, current_pt_regs(),
+ -EINTR, 0);
/* Trigger a coredump with SIGSYS */
force_sig_seccomp(this_syscall, data, true);
} else {
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2916,6 +2916,11 @@ bool get_signal(struct ksignal *ksig)
if (!signr)
break; /* will return 0 */
+
+ // incomplete and ugly, just for illustration
+ if (ksig->info.si_code == SYS_SECCOMP)
+ syscall_rollback(current, current_pt_regs());
+
if (unlikely(current->ptrace) && (signr != SIGKILL) &&
!(sighand->action[signr -1].sa.sa_flags & SA_IMMUTABLE)) {
signr = ptrace_signal(signr, &ksig->info, type);
On 23-03-2026 17:39, Oleg Nesterov wrote: > Perhaps _something_ like the change below makes more sense? We have been working internally on a related issue in the same seccomp/signal area, so sharing our thoughts here in case they are useful. This change does seem closer to the real condition than checking SECCOMP_MODE_DEAD in syscall_exit_work(). In our analysis too, the bogus syscall-exit report appears to be a real issue, in seccomp paths which do syscall_rollback(), e.g. the fatal kill path and also SECCOMP_RET_TRAP, the return register no longer reflects a valid exit result. So ptrace can observe a value that did not come from a completed syscall. Because of that, using SECCOMP_MODE_DEAD still feels a bit broader than the exact condition. It couples syscall-exit suppression to a persistent seccomp task state, while the reason to suppress reporting seems more specific to a single syscall instance: once that syscall has been rolled back, it never actually completed, so there is no valid exit result to report. From that point of view, a per-syscall “aborted after rollback” condition still feels like the more natural abstraction. It also seems worth considering whether the same issue extends beyond ptrace syscall-exit reporting to other exit-side observers such as audit_syscall_exit() and trace_syscall_exit(). Also, on the TODO from the RFC: > TODO: With or without this change, get_signal() -> ptrace_signal() may > report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS. > Perhaps it makes sense to change get_signal() to check > SECCOMP_MODE_DEAD too and prioritize the fatal SIGSYS. while tracing the same overall issue locally, we hit another path where the forced fatal SIGSYS could be taken off the normal delivery path before get_signal() handled it, in our case via signalfd. There, force_sig_seccomp(..., true) marks SIGSYS as SA_IMMUTABLE via HANDLER_EXIT, but signalfd could still dequeue it before normal fatal delivery. So this direction looks better than the original RFC, but for the overall solution to be reliable, it would probably also need to ensure that a forced fatal SA_IMMUTABLE signal is not bypassed by other signal-ordering, delivery, or consumption paths. Thanks Kusaram
Thanks Kusaram! I was travelling, hope to send V2 this weekend. And write a more detailed reply. Just one note for now: On 04/03, Kusaram Devineni wrote: > > while tracing the same overall issue locally, we hit another path where the > forced fatal SIGSYS could be taken off the normal delivery path before > get_signal() handled it, in our case via signalfd. There, > force_sig_seccomp(..., true) marks SIGSYS as SA_IMMUTABLE via HANDLER_EXIT, > but signalfd could still dequeue it before normal fatal delivery. How? seccomp does force_sig_seccomp() sends the signal to current, current can't return to usermode and call signalfd_dequeue(), get_signal() must dequeue SIGSYS and notice SA_IMMUTABLE. And since this signal is private, signalfd_dequeue() from another thread can't dequeue it either. No? Oleg.
On 03-04-2026 21:18, Oleg Nesterov wrote: > seccomp does force_sig_seccomp() sends the signal to current, current can't > return to usermode and call signalfd_dequeue(), get_signal() must dequeue > SIGSYS and notice SA_IMMUTABLE. > And since this signal is private, signalfd_dequeue() from another thread can't > dequeue it either. > No? Right Oleg, not by returning to userspace and calling signalfd_dequeue() afterward, and not from another thread. We identified a case when working on a syzbot bug https://syzbot.org/bug?extid=0a4c46806941297fecb9 where the forced SIGSYS was consumed through the signalfd path from task_work on the same task before get_signal() handled normal fatal delivery. The setup there had an outstanding io_uring-driven signalfd request, and task_work_run() executed before get_signal() dequeued the fatal SIGSYS. So the sequence was roughly: seccomp -> force_sig_seccomp(..., true) -> pending private SIGSYS get_signal() entry -> task_work_run() task_work/signalfd path consumes SIGSYS get_signal() then no longer sees it to dequeue That allowed the task to survive 'long enough' to enter another syscall in SECCOMP_MODE_DEAD and hit the WARN_ON_ONCE() in __secure_computing(). So your point is correct in the normal case: current cannot return to userspace and then call signalfd_dequeue(), and another thread cannot dequeue this private signal. The case we hit was narrower and more specific: same-task consumption via task_work before normal fatal delivery. For that specific path, one approach that seems to work is making signalfd exclude SA_IMMUTABLE signals from the mask it passes to next_signal()/dequeue_signal(), so kernel-forced fatal signals remain pending for normal delivery via get_signal(). Kusaram
© 2016 - 2026 Red Hat, Inc.