[PATCH v3 0/2] further damage-control lack of clone scalability

Mateusz Guzik posted 2 patches 2 months ago
include/linux/ns/ns_common_types.h |   4 +-
kernel/pid.c                       | 134 ++++++++++++++++++-----------
2 files changed, 89 insertions(+), 49 deletions(-)
[PATCH v3 0/2] further damage-control lack of clone scalability
Posted by Mateusz Guzik 2 months ago
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case.

Benchmark code at the end.

With this patchset alloc_pid() only takes the lock once and consequently
alleviates the problem. While scalability improves, the lock remains the
primary bottleneck by a large margin.

I believe idr is a poor choice for the task at hand to begin with, but
sorting out that out beyond the scope of this patchset. At the same time
any replacement would be best evaluated against a state where the
above relock problem is fixed.

Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.

bench from will-it-scale:

#include <assert.h>
#include <pthread.h>

char *testcase_description = "Thread creation and teardown";

static void *worker(void *arg)
{
        return (NULL);
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
        pthread_t thread[1];
        int error;

        while (1) {
                for (int i = 0; i < 1; i++) {
                        error = pthread_create(&thread[i], NULL, worker, NULL);
                        assert(error == 0);
                }
                for (int i = 0; i < 1; i++) {
                        error = pthread_join(thread[i], NULL);
                        assert(error == 0);
                }
                (*iterations)++;
        }
}

v3:
- fix some whitespace and one typo
- slightly reword the ENOMEM comment
- move i-- in the first loop towards the end for consistency with the
  other loop
- 2 extra unlikely for initial error conditions

I retained Oleg's r-b as the changes don't affect behavior

v2:
- cosmetic fixes from Oleg
- drop idr_preload_many, relock pidmap + call idr_preload again instead
- write a commit message 

Mateusz Guzik (2):
  ns: pad refcount
  pid: only take pidmap_lock once on alloc

 include/linux/ns/ns_common_types.h |   4 +-
 kernel/pid.c                       | 134 ++++++++++++++++++-----------
 2 files changed, 89 insertions(+), 49 deletions(-)

-- 
2.48.1
Re: [PATCH v3 0/2] further damage-control lack of clone scalability
Posted by Christian Brauner 2 weeks, 5 days ago
On Sat, 06 Dec 2025 14:19:53 +0100, Mateusz Guzik wrote:
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
> 
> Benchmark code at the end.
> 
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
> 
> [...]

Applied to the kernel-7.0.misc branch of the vfs/vfs.git tree.
Patches in the kernel-7.0.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: kernel-7.0.misc

[1/2] ns: pad refcount
      https://git.kernel.org/vfs/vfs/c/0832219c84f2
[2/2] pid: only take pidmap_lock once on alloc
      https://git.kernel.org/vfs/vfs/c/fb374f0d6fc6
Re: [PATCH v3 0/2] further damage-control lack of clone scalability
Posted by Mateusz Guzik 2 weeks, 5 days ago
On Sat, Dec 6, 2025 at 2:20 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
>
> Benchmark code at the end.
>
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
>
> I believe idr is a poor choice for the task at hand to begin with, but
> sorting out that out beyond the scope of this patchset. At the same time
> any replacement would be best evaluated against a state where the
> above relock problem is fixed.
>
> Performance improvement varies between reboots. When benchmarking with
> 20 processes creating and killing threads in a loop, the unpatched
> baseline hovers around 465k ops/s, while patched is anything between
> ~510k ops/s and ~560k depending on false-sharing (which I only minimally
> sanitized). So this is at least 10% if you are unlucky.
>

I had another look at the problem concerning steps after this patchset.

I found the primary problem is pidfs support -- commenting it out
gives me about 40% boost, afterwards top of the profile is cgroups
with its 3 lock acquires per thread life cycle:

@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irqsave+45
    cgroup_task_dead+33
    finish_task_switch.isra.0+555
    schedule_tail+11
    ret_from_fork+27
    ret_from_fork_asm+26
]: 2550200
@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irq+38
    cgroup_post_fork+57
    copy_process+5993
    kernel_clone+148
    __do_sys_clone3+188
    do_syscall_64+78
    entry_SYSCALL_64_after_hwframe+118
]: 3486368
@[
    __pv_queued_spin_lock_slowpath+1
    _raw_spin_lock_irq+38
    cgroup_can_fork+110
    copy_process+4940
    kernel_clone+148
    __do_sys_clone3+188
    do_syscall_64+78
    entry_SYSCALL_64_after_hwframe+118
]: 3487665

currently the pidfs thing is implemented with a red black tree.

Whatever the replacement it should be faster and have its own
non-global locking.

I don't know what's available in the kernel to deal with it instead.
Is it rhashtable?

I would not mind whatsoever if someone else dealt with it. :-)

> bench from will-it-scale:
>
> #include <assert.h>
> #include <pthread.h>
>
> char *testcase_description = "Thread creation and teardown";
>
> static void *worker(void *arg)
> {
>         return (NULL);
> }
>
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
>         pthread_t thread[1];
>         int error;
>
>         while (1) {
>                 for (int i = 0; i < 1; i++) {
>                         error = pthread_create(&thread[i], NULL, worker, NULL);
>                         assert(error == 0);
>                 }
>                 for (int i = 0; i < 1; i++) {
>                         error = pthread_join(thread[i], NULL);
>                         assert(error == 0);
>                 }
>                 (*iterations)++;
>         }
> }
>
> v3:
> - fix some whitespace and one typo
> - slightly reword the ENOMEM comment
> - move i-- in the first loop towards the end for consistency with the
>   other loop
> - 2 extra unlikely for initial error conditions
>
> I retained Oleg's r-b as the changes don't affect behavior
>
> v2:
> - cosmetic fixes from Oleg
> - drop idr_preload_many, relock pidmap + call idr_preload again instead
> - write a commit message
>
> Mateusz Guzik (2):
>   ns: pad refcount
>   pid: only take pidmap_lock once on alloc
>
>  include/linux/ns/ns_common_types.h |   4 +-
>  kernel/pid.c                       | 134 ++++++++++++++++++-----------
>  2 files changed, 89 insertions(+), 49 deletions(-)
>
> --
> 2.48.1
>