ptrace: don't report syscall-exit if the tracee was killed by seccomp

[RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Oleg Nesterov 1 week, 5 days ago

__seccomp_filter() does

	case SECCOMP_RET_KILL_THREAD:
	case SECCOMP_RET_KILL_PROCESS:
	...
		/* Show the original registers in the dump. */
		syscall_rollback(current, current_pt_regs());

		/* Trigger a coredump with SIGSYS */
		force_sig_seccomp(this_syscall, data, true);

syscall_rollback() does regs->ax == orig_ax. This means that
ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer,
it looks as if the aborted syscall actually succeeded and returned its
own syscall number.

And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't
be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee
will "silently" exit with error_code == SIGSYS after the bogus report.

Change syscall_exit_work() to avoid the bogus single-step/syscall-exit
reports if the tracee is SECCOMP_MODE_DEAD.

TODO: With or without this change, get_signal() -> ptrace_signal() may
report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD
too and prioritize the fatal SIGSYS.

Reported-by: Max Ver <dudududumaxver@gmail.com>
Closes: https://lore.kernel.org/all/CABjJbFJO+p3jA1r0gjUZrCepQb1Fab3kqxYhc_PSfoqo21ypeQ@mail.gmail.com/
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/entry-common.h | 3 +++
 include/linux/seccomp.h      | 8 ++++++++
 kernel/seccomp.c             | 3 ---
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index f83ca0abf2cd..5c62bda9dcf9 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -250,6 +250,9 @@ static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned lon
 	if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
 		trace_syscall_exit(regs, syscall_get_return_value(current, regs));
 
+	if (killed_by_seccomp(current))
+		return;
+
 	step = report_single_step(work);
 	if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
 		arch_ptrace_report_syscall_exit(regs, step);
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9b959972bf4a..e95a251955c1 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -22,6 +22,12 @@
 #include <linux/atomic.h>
 #include <asm/seccomp.h>
 
+/* Not exposed in uapi headers: internal use only. */
+#define SECCOMP_MODE_DEAD	(SECCOMP_MODE_FILTER + 1)
+
+#define killed_by_seccomp(task)	\
+	((task)->seccomp.mode == SECCOMP_MODE_DEAD)
+
 extern int __secure_computing(void);
 
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
@@ -49,6 +55,8 @@ static inline int seccomp_mode(struct seccomp *s)
 
 struct seccomp_data;
 
+#define killed_by_seccomp(task)	0
+
 #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
 static inline int secure_computing(void) { return 0; }
 #else
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 066909393c38..461eb15c66c3 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -31,9 +31,6 @@
 
 #include <asm/syscall.h>
 
-/* Not exposed in headers: strictly internal use only. */
-#define SECCOMP_MODE_DEAD	(SECCOMP_MODE_FILTER + 1)
-
 #ifdef CONFIG_SECCOMP_FILTER
 #include <linux/file.h>
 #include <linux/filter.h>
-- 
2.52.0

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Andrew Morton 1 week, 5 days ago

On Sun, 22 Mar 2026 14:44:54 +0100 Oleg Nesterov <oleg@redhat.com> wrote:

> __seccomp_filter() does
> 
> 	case SECCOMP_RET_KILL_THREAD:
> 	case SECCOMP_RET_KILL_PROCESS:
> 	...
> 		/* Show the original registers in the dump. */
> 		syscall_rollback(current, current_pt_regs());
> 
> 		/* Trigger a coredump with SIGSYS */
> 		force_sig_seccomp(this_syscall, data, true);
> 
> syscall_rollback() does regs->ax == orig_ax. This means that
> ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer,
> it looks as if the aborted syscall actually succeeded and returned its
> own syscall number.
> 
> And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't
> be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee
> will "silently" exit with error_code == SIGSYS after the bogus report.
> 
> Change syscall_exit_work() to avoid the bogus single-step/syscall-exit
> reports if the tracee is SECCOMP_MODE_DEAD.
> 
> TODO: With or without this change, get_signal() -> ptrace_signal() may
> report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
> Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD
> too and prioritize the fatal SIGSYS.

AI review has questions:
	https://sashiko.dev/#/patchset/ab_yVqQ7WW3flal3@redhat.com

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Oleg Nesterov 1 week, 5 days ago

On 03/22, Andrew Morton wrote:
>
> On Sun, 22 Mar 2026 14:44:54 +0100 Oleg Nesterov <oleg@redhat.com> wrote:
>
> > __seccomp_filter() does
> >
> > 	case SECCOMP_RET_KILL_THREAD:
> > 	case SECCOMP_RET_KILL_PROCESS:
> > 	...
> > 		/* Show the original registers in the dump. */
> > 		syscall_rollback(current, current_pt_regs());
> >
> > 		/* Trigger a coredump with SIGSYS */
> > 		force_sig_seccomp(this_syscall, data, true);
> >
> > syscall_rollback() does regs->ax == orig_ax. This means that
> > ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer,
> > it looks as if the aborted syscall actually succeeded and returned its
> > own syscall number.
> >
> > And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't
> > be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee
> > will "silently" exit with error_code == SIGSYS after the bogus report.
> >
> > Change syscall_exit_work() to avoid the bogus single-step/syscall-exit
> > reports if the tracee is SECCOMP_MODE_DEAD.
> >
> > TODO: With or without this change, get_signal() -> ptrace_signal() may
> > report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
> > Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD
> > too and prioritize the fatal SIGSYS.
>
> AI review has questions:
> 	https://sashiko.dev/#/patchset/ab_yVqQ7WW3flal3@redhat.com

Excellent question ;) Thanks sashiko!

I will have this in mind when (if) I send V2.

So far my main concern is the behavioral change caused by my RFC, I will wait
for more comments before that.

In any case: yes! I have missed another syscall_rollback() on SECCOMP_RET_TRAP in
__seccomp_filter(). In this case force_sig_seccomp() uses force_coredump == false,
so SIGSYS will be reported. But this doesn't really make a difference wrt ptrace
confusion.

Thanks!

Oleg.

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Kees Cook 1 week, 5 days ago


On March 22, 2026 6:44:54 AM PDT, Oleg Nesterov <oleg@redhat.com> wrote:
>__seccomp_filter() does
>
>	case SECCOMP_RET_KILL_THREAD:
>	case SECCOMP_RET_KILL_PROCESS:
>	...
>		/* Show the original registers in the dump. */
>		syscall_rollback(current, current_pt_regs());
>
>		/* Trigger a coredump with SIGSYS */
>		force_sig_seccomp(this_syscall, data, true);
>
>syscall_rollback() does regs->ax == orig_ax. This means that
>ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer,
>it looks as if the aborted syscall actually succeeded and returned its
>own syscall number.
>
>And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't
>be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee
>will "silently" exit with error_code == SIGSYS after the bogus report.
>
>Change syscall_exit_work() to avoid the bogus single-step/syscall-exit
>reports if the tracee is SECCOMP_MODE_DEAD.
>
>TODO: With or without this change, get_signal() -> ptrace_signal() may
>report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
>Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD
>too and prioritize the fatal SIGSYS.
>
>Reported-by: Max Ver <dudududumaxver@gmail.com>
>Closes: https://lore.kernel.org/all/CABjJbFJO+p3jA1r0gjUZrCepQb1Fab3kqxYhc_PSfoqo21ypeQ@mail.gmail.com/
>Signed-off-by: Oleg Nesterov <oleg@redhat.com>
>---
> include/linux/entry-common.h | 3 +++
> include/linux/seccomp.h      | 8 ++++++++
> kernel/seccomp.c             | 3 ---
> 3 files changed, 11 insertions(+), 3 deletions(-)
>
>diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
>index f83ca0abf2cd..5c62bda9dcf9 100644
>--- a/include/linux/entry-common.h
>+++ b/include/linux/entry-common.h
>@@ -250,6 +250,9 @@ static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned lon
> 	if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
> 		trace_syscall_exit(regs, syscall_get_return_value(current, regs));
> 
>+	if (killed_by_seccomp(current))
>+		return;

Hmm. I'm still not convinced this is right, but if we make this change, I'd want to see a behavioral test added (likely to the seccomp self tests), and to make sure the rr test suite doesn't regress. It's traditionally been the most sensitive to these kinds of notification ordering/behavior changes.

-Kees

>+
> 	step = report_single_step(work);
> 	if (step || work & SYSCALL_WORK_SYSCALL_TRACE)
> 		arch_ptrace_report_syscall_exit(regs, step);
>diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>index 9b959972bf4a..e95a251955c1 100644
>--- a/include/linux/seccomp.h
>+++ b/include/linux/seccomp.h
>@@ -22,6 +22,12 @@
> #include <linux/atomic.h>
> #include <asm/seccomp.h>
> 
>+/* Not exposed in uapi headers: internal use only. */
>+#define SECCOMP_MODE_DEAD	(SECCOMP_MODE_FILTER + 1)
>+
>+#define killed_by_seccomp(task)	\
>+	((task)->seccomp.mode == SECCOMP_MODE_DEAD)
>+
> extern int __secure_computing(void);
> 
> #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
>@@ -49,6 +55,8 @@ static inline int seccomp_mode(struct seccomp *s)
> 
> struct seccomp_data;
> 
>+#define killed_by_seccomp(task)	0
>+
> #ifdef CONFIG_HAVE_ARCH_SECCOMP_FILTER
> static inline int secure_computing(void) { return 0; }
> #else
>diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>index 066909393c38..461eb15c66c3 100644
>--- a/kernel/seccomp.c
>+++ b/kernel/seccomp.c
>@@ -31,9 +31,6 @@
> 
> #include <asm/syscall.h>
> 
>-/* Not exposed in headers: strictly internal use only. */
>-#define SECCOMP_MODE_DEAD	(SECCOMP_MODE_FILTER + 1)
>-
> #ifdef CONFIG_SECCOMP_FILTER
> #include <linux/file.h>
> #include <linux/filter.h>

-- 
Kees Cook

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Oleg Nesterov 1 week, 5 days ago

On 03/22, Kees Cook wrote:
>
> On March 22, 2026 6:44:54 AM PDT, Oleg Nesterov <oleg@redhat.com> wrote:
> >__seccomp_filter() does
> >
> >	case SECCOMP_RET_KILL_THREAD:
> >	case SECCOMP_RET_KILL_PROCESS:
> >	...
> >		/* Show the original registers in the dump. */
> >		syscall_rollback(current, current_pt_regs());
> >
> >		/* Trigger a coredump with SIGSYS */
> >		force_sig_seccomp(this_syscall, data, true);
> >
> >syscall_rollback() does regs->ax == orig_ax. This means that
> >ptrace_get_syscall_info_exit() will see .is_error == 0. To the tracer,
> >it looks as if the aborted syscall actually succeeded and returned its
> >own syscall number.
> >
> >And since force_sig_seccomp() uses force_coredump == true, SIGSYS won't
> >be reported (see the SA_IMMUTABLE check in get_signal()), so the tracee
> >will "silently" exit with error_code == SIGSYS after the bogus report.
> >
> >Change syscall_exit_work() to avoid the bogus single-step/syscall-exit
> >reports if the tracee is SECCOMP_MODE_DEAD.
> >
> >TODO: With or without this change, get_signal() -> ptrace_signal() may
> >report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
> >Perhaps it makes sense to change get_signal() to check SECCOMP_MODE_DEAD
> >too and prioritize the fatal SIGSYS.
> >
> >Reported-by: Max Ver <dudududumaxver@gmail.com>
> >Closes: https://lore.kernel.org/all/CABjJbFJO+p3jA1r0gjUZrCepQb1Fab3kqxYhc_PSfoqo21ypeQ@mail.gmail.com/
> >Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> >---
> > include/linux/entry-common.h | 3 +++
> > include/linux/seccomp.h      | 8 ++++++++
> > kernel/seccomp.c             | 3 ---
> > 3 files changed, 11 insertions(+), 3 deletions(-)
> >
> >diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
> >index f83ca0abf2cd..5c62bda9dcf9 100644
> >--- a/include/linux/entry-common.h
> >+++ b/include/linux/entry-common.h
> >@@ -250,6 +250,9 @@ static __always_inline void syscall_exit_work(struct pt_regs *regs, unsigned lon
> > 	if (work & SYSCALL_WORK_SYSCALL_TRACEPOINT)
> > 		trace_syscall_exit(regs, syscall_get_return_value(current, regs));
> >
> >+	if (killed_by_seccomp(current))
> >+		return;
>
> Hmm. I'm still not convinced this is right,

Me too actually ;)

That is why RFC. So:

	- Do you agree that the current behaviour is not really "sane" and
	  can confuse ptracers?

	- If yes, what else do you think we can do? No, I no longer think it
	  makes sense to change the ptrace_get_syscall_info_exit() paths...


> but if we make this change, I'd want to see a behavioral test added
> (likely to the seccomp self tests), and to make sure the rr test suite doesn't regress.

OK. I'll try to take a look at these tests and possibly add another one.

But (sorry) not the next week, I will be travelling.

Oleg.

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Oleg Nesterov 1 week, 4 days ago

On 03/22, Oleg Nesterov wrote:
>
> On 03/22, Kees Cook wrote:
> >
> > Hmm. I'm still not convinced this is right,
>
> Me too actually ;)
>
> That is why RFC. So:
>
> 	- Do you agree that the current behaviour is not really "sane" and
> 	  can confuse ptracers?
>
> 	- If yes, what else do you think we can do? No, I no longer think it
> 	  makes sense to change the ptrace_get_syscall_info_exit() paths...

Perhaps _something_ like the change below makes more sense?

Oleg.

--- x/kernel/seccomp.c
+++ x/kernel/seccomp.c
@@ -1357,8 +1357,8 @@ static int __seccomp_filter(int this_sys
 		/* Dump core only if this is the last remaining thread. */
 		if (action != SECCOMP_RET_KILL_THREAD ||
 		    (atomic_read(&current->signal->live) == 1)) {
-			/* Show the original registers in the dump. */
-			syscall_rollback(current, current_pt_regs());
+			syscall_set_return_value(current, current_pt_regs(),
+						 -EINTR, 0);
 			/* Trigger a coredump with SIGSYS */
 			force_sig_seccomp(this_syscall, data, true);
 		} else {
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2916,6 +2916,11 @@ bool get_signal(struct ksignal *ksig)
 		if (!signr)
 			break; /* will return 0 */
 
+
+		// incomplete and ugly, just for illustration
+		if (ksig->info.si_code == SYS_SECCOMP)
+			syscall_rollback(current, current_pt_regs());
+
 		if (unlikely(current->ptrace) && (signr != SIGKILL) &&
 		    !(sighand->action[signr -1].sa.sa_flags & SA_IMMUTABLE)) {
 			signr = ptrace_signal(signr, &ksig->info, type);

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Kusaram Devineni 5 hours ago

On 23-03-2026 17:39, Oleg Nesterov wrote:
> Perhaps _something_ like the change below makes more sense?

We have been working internally on a related issue in the same 
seccomp/signal area, so sharing our thoughts here in case they are useful.

This change does seem closer to the real condition than checking
SECCOMP_MODE_DEAD in syscall_exit_work(). In our analysis too, the bogus 
syscall-exit report appears to be a real issue, in seccomp paths which 
do syscall_rollback(), e.g. the fatal kill path and also 
SECCOMP_RET_TRAP, the return register no longer reflects a valid exit 
result. So ptrace can observe a value that did not come from a completed 
syscall.

Because of that, using SECCOMP_MODE_DEAD still feels a bit broader than 
the exact condition. It couples syscall-exit suppression to a persistent 
seccomp task state, while the reason to suppress reporting seems more 
specific to a single syscall instance: once that syscall has been rolled 
back, it never actually completed, so there is no valid exit result to 
report. From that point of view, a per-syscall “aborted after rollback” 
condition still feels like the more natural abstraction.

It also seems worth considering whether the same issue extends beyond 
ptrace syscall-exit reporting to other exit-side observers such as 
audit_syscall_exit() and trace_syscall_exit().

Also, on the TODO from the RFC:

> TODO: With or without this change, get_signal() -> ptrace_signal() may
> report other !SA_IMMUTABLE pending signals before it dequeues SIGSYS.
> Perhaps it makes sense to change get_signal() to check
> SECCOMP_MODE_DEAD too and prioritize the fatal SIGSYS.

while tracing the same overall issue locally, we hit another path where 
the forced fatal SIGSYS could be taken off the normal delivery path 
before get_signal() handled it, in our case via signalfd. There,
force_sig_seccomp(..., true) marks SIGSYS as SA_IMMUTABLE via 
HANDLER_EXIT, but signalfd could still dequeue it before normal fatal 
delivery.

So this direction looks better than the original RFC, but for the 
overall solution to be reliable, it would probably also need to ensure 
that a forced fatal SA_IMMUTABLE signal is not bypassed by other 
signal-ordering, delivery, or consumption paths.

Thanks
Kusaram

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Oleg Nesterov 5 hours ago

Thanks Kusaram!

I was travelling, hope to send V2 this weekend. And write a more
detailed reply.

Just one note for now:

On 04/03, Kusaram Devineni wrote:
>
> while tracing the same overall issue locally, we hit another path where the
> forced fatal SIGSYS could be taken off the normal delivery path before
> get_signal() handled it, in our case via signalfd. There,
> force_sig_seccomp(..., true) marks SIGSYS as SA_IMMUTABLE via HANDLER_EXIT,
> but signalfd could still dequeue it before normal fatal delivery.

How?

seccomp does force_sig_seccomp() sends the signal to current, current can't
return to usermode and call signalfd_dequeue(), get_signal() must dequeue
SIGSYS and notice SA_IMMUTABLE.

And since this signal is private, signalfd_dequeue() from another thread can't
dequeue it either.

No?

Oleg.

Re: [RFC PATCH] ptrace: don't report syscall-exit if the tracee was killed by seccomp

Posted by Kusaram Devineni 3 hours ago

On 03-04-2026 21:18, Oleg Nesterov wrote:

 > seccomp does force_sig_seccomp() sends the signal to current, current 
can't
 > return to usermode and call signalfd_dequeue(), get_signal() must dequeue
 > SIGSYS and notice SA_IMMUTABLE.

 > And since this signal is private, signalfd_dequeue() from another 
thread can't
 > dequeue it either.

 > No?

Right Oleg, not by returning to userspace and calling signalfd_dequeue() 
afterward,
and not from another thread.

We identified a case when working on a syzbot bug
https://syzbot.org/bug?extid=0a4c46806941297fecb9 where the forced 
SIGSYS was
consumed through the signalfd path from task_work on the same task 
before get_signal()
handled normal fatal delivery. The setup there had an outstanding 
io_uring-driven signalfd
request, and task_work_run() executed before get_signal() dequeued the 
fatal SIGSYS.

So the sequence was roughly:
     seccomp -> force_sig_seccomp(..., true) -> pending private SIGSYS
     get_signal() entry -> task_work_run()
     task_work/signalfd path consumes SIGSYS
     get_signal() then no longer sees it to dequeue

That allowed the task to survive 'long enough' to enter another syscall in
SECCOMP_MODE_DEAD and hit the WARN_ON_ONCE() in __secure_computing().

So your point is correct in the normal case: current cannot return to 
userspace
and then call signalfd_dequeue(), and another thread cannot dequeue this
private signal. The case we hit was narrower and more specific: same-task
consumption via task_work before normal fatal delivery.

For that specific path, one approach that seems to work is making 
signalfd exclude
SA_IMMUTABLE signals from the mask it passes to 
next_signal()/dequeue_signal(),
so kernel-forced fatal signals remain pending for normal delivery via 
get_signal().

Kusaram