[PATCH] seccomp: avoid the lock trip in seccomp_filter_release in common case

Mateusz Guzik posted 1 patch 12 months ago
kernel/seccomp.c | 3 +++
1 file changed, 3 insertions(+)
[PATCH] seccomp: avoid the lock trip in seccomp_filter_release in common case
Posted by Mateusz Guzik 12 months ago
Vast majority of threads don't have any seccomp filters, all while the
lock taken here is shared between all threads in given process and
frequently used.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

Here is a splat from parallel thread creation/destruction within onep
rocess:

bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'

[snip]
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irq+42
    seccomp_filter_release+32
    do_exit+286
    __x64_sys_exit+27
    x64_sys_call+4703
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 475601
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irq+42
    acct_collect+77
    do_exit+1380
    __x64_sys_exit+27
    x64_sys_call+4703
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 478335
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irq+42
    sigprocmask+106
    __x64_sys_rt_sigprocmask+121
    do_syscall_64+82
    entry_SYSCALL_64_after_hwframe+118
]: 1825572

There are more spots which take the same lock, with seccomp being top 3.

I could not be bothered to bench before/after, but I can do it if you
insist. The fact that this codepath is a factor can be seen above.

This is a minor patch, I'm not going to insist on it.

To my reading seccomp only ever gets populated for current, so this
should be perfectly safe to test on exit without any synchronisation.

This may need a data_race annotation if some tooling decides to protest.

 kernel/seccomp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 7bbb408431eb..c839674966e2 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -576,6 +576,9 @@ void seccomp_filter_release(struct task_struct *tsk)
 	if (WARN_ON((tsk->flags & PF_EXITING) == 0))
 		return;
 
+	if (tsk->seccomp.filter == NULL)
+		return;
+
 	spin_lock_irq(&tsk->sighand->siglock);
 	orig = tsk->seccomp.filter;
 	/* Detach task from its filter tree. */
-- 
2.43.0
Re: [PATCH] seccomp: avoid the lock trip in seccomp_filter_release in common case
Posted by Kees Cook 12 months ago
On Mon, Feb 10, 2025 at 10:05:41PM +0100, Mateusz Guzik wrote:
> Vast majority of threads don't have any seccomp filters, all while the
> lock taken here is shared between all threads in given process and
> frequently used.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
> 
> Here is a splat from parallel thread creation/destruction within onep
> rocess:
> 
> bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'
> 
> [snip]
> @[
>     __pv_queued_spin_lock_slowpath+5
>     _raw_spin_lock_irq+42
>     seccomp_filter_release+32
>     do_exit+286
>     __x64_sys_exit+27
>     x64_sys_call+4703
>     do_syscall_64+82
>     entry_SYSCALL_64_after_hwframe+118
> ]: 475601
> @[
>     __pv_queued_spin_lock_slowpath+5
>     _raw_spin_lock_irq+42
>     acct_collect+77
>     do_exit+1380
>     __x64_sys_exit+27
>     x64_sys_call+4703
>     do_syscall_64+82
>     entry_SYSCALL_64_after_hwframe+118
> ]: 478335
> @[
>     __pv_queued_spin_lock_slowpath+5
>     _raw_spin_lock_irq+42
>     sigprocmask+106
>     __x64_sys_rt_sigprocmask+121
>     do_syscall_64+82
>     entry_SYSCALL_64_after_hwframe+118
> ]: 1825572
> 
> There are more spots which take the same lock, with seccomp being top 3.
> 
> I could not be bothered to bench before/after, but I can do it if you
> insist. The fact that this codepath is a factor can be seen above.
> 
> This is a minor patch, I'm not going to insist on it.
> 
> To my reading seccomp only ever gets populated for current, so this
> should be perfectly safe to test on exit without any synchronisation.

Unfortunately, this is not true. SECCOMP_FILTER_FLAG_TSYNC may operate
on non-current. See the paired smp_store_release() and smp_rmb() uses.

> This may need a data_race annotation if some tooling decides to protest.
> 
>  kernel/seccomp.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 7bbb408431eb..c839674966e2 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -576,6 +576,9 @@ void seccomp_filter_release(struct task_struct *tsk)
>  	if (WARN_ON((tsk->flags & PF_EXITING) == 0))
>  		return;
>  
> +	if (tsk->seccomp.filter == NULL)
> +		return;
> +
>  	spin_lock_irq(&tsk->sighand->siglock);
>  	orig = tsk->seccomp.filter;
>  	/* Detach task from its filter tree. */

If this can be made provably race-safe, I'd be fine with the idea, but
at present, all seccomp.filter manipulation happens under
sighand->siglock...

-Kees

-- 
Kees Cook
Re: [PATCH] seccomp: avoid the lock trip in seccomp_filter_release in common case
Posted by Mateusz Guzik 12 months ago
On Mon, Feb 10, 2025 at 11:57 PM Kees Cook <kees@kernel.org> wrote:
>
> On Mon, Feb 10, 2025 at 10:05:41PM +0100, Mateusz Guzik wrote:
> > Vast majority of threads don't have any seccomp filters, all while the
> > lock taken here is shared between all threads in given process and
> > frequently used.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
> >
> > Here is a splat from parallel thread creation/destruction within onep
> > rocess:
> >
> > bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'
> >
> > [snip]
> > @[
> >     __pv_queued_spin_lock_slowpath+5
> >     _raw_spin_lock_irq+42
> >     seccomp_filter_release+32
> >     do_exit+286
> >     __x64_sys_exit+27
> >     x64_sys_call+4703
> >     do_syscall_64+82
> >     entry_SYSCALL_64_after_hwframe+118
> > ]: 475601
> > @[
> >     __pv_queued_spin_lock_slowpath+5
> >     _raw_spin_lock_irq+42
> >     acct_collect+77
> >     do_exit+1380
> >     __x64_sys_exit+27
> >     x64_sys_call+4703
> >     do_syscall_64+82
> >     entry_SYSCALL_64_after_hwframe+118
> > ]: 478335
> > @[
> >     __pv_queued_spin_lock_slowpath+5
> >     _raw_spin_lock_irq+42
> >     sigprocmask+106
> >     __x64_sys_rt_sigprocmask+121
> >     do_syscall_64+82
> >     entry_SYSCALL_64_after_hwframe+118
> > ]: 1825572
> >
> > There are more spots which take the same lock, with seccomp being top 3.
> >
> > I could not be bothered to bench before/after, but I can do it if you
> > insist. The fact that this codepath is a factor can be seen above.
> >
> > This is a minor patch, I'm not going to insist on it.
> >
> > To my reading seccomp only ever gets populated for current, so this
> > should be perfectly safe to test on exit without any synchronisation.
>
> Unfortunately, this is not true. SECCOMP_FILTER_FLAG_TSYNC may operate
> on non-current. See the paired smp_store_release() and smp_rmb() uses.
>

Ok I see:

  630
   631                 /* Make our new filter tree visible. */
   632                 smp_store_release(&thread->seccomp.filter,
   633                                   caller->seccomp.filter);
   634                 atomic_set(&thread->seccomp.filter_count,
   635                 │          atomic_read(&caller->seccomp.filter_count));

My grep was too naive :)

> > This may need a data_race annotation if some tooling decides to protest.
> >
> >  kernel/seccomp.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 7bbb408431eb..c839674966e2 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -576,6 +576,9 @@ void seccomp_filter_release(struct task_struct *tsk)
> >       if (WARN_ON((tsk->flags & PF_EXITING) == 0))
> >               return;
> >
> > +     if (tsk->seccomp.filter == NULL)
> > +             return;
> > +
> >       spin_lock_irq(&tsk->sighand->siglock);
> >       orig = tsk->seccomp.filter;
> >       /* Detach task from its filter tree. */
>
> If this can be made provably race-safe, I'd be fine with the idea, but
> at present, all seccomp.filter manipulation happens under
> sighand->siglock...
>

The loop in seccomp_sync_threads actively skips every thread with the
PF_EXITING flag set, while it is an invariant that any thread passed
to seccomp_filter_release has the flag. So one only needs to ensure
visibility.

The flag gets set in exit_signals().

If the target thread is the only remaining there is no sighand lock
taken, but there is also nobody to execute seccomp_sync_threads.

If there are more threads, the flag gets set under the sighand lock.
Meaning by the time it gets unlocked, any other thread executing
seccomp_sync_threads is going to spot it and refrain from messing with
the exiting thread.

I'm going to need to sleep on it, maybe I'm missing something, but I
think this *does* work as is despite my failed initial assertion of
only current messing with it.

-- 
Mateusz Guzik <mjguzik gmail.com>