include/linux/ns/ns_common_types.h | 4 +- kernel/pid.c | 134 ++++++++++++++++++----------- 2 files changed, 89 insertions(+), 49 deletions(-)
When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case.
Benchmark code at the end.
With this patchset alloc_pid() only takes the lock once and consequently
alleviates the problem. While scalability improves, the lock remains the
primary bottleneck by a large margin.
I believe idr is a poor choice for the task at hand to begin with, but
sorting out that out beyond the scope of this patchset. At the same time
any replacement would be best evaluated against a state where the
above relock problem is fixed.
Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.
bench from will-it-scale:
#include <assert.h>
#include <pthread.h>
char *testcase_description = "Thread creation and teardown";
static void *worker(void *arg)
{
return (NULL);
}
void testcase(unsigned long long *iterations, unsigned long nr)
{
pthread_t thread[1];
int error;
while (1) {
for (int i = 0; i < 1; i++) {
error = pthread_create(&thread[i], NULL, worker, NULL);
assert(error == 0);
}
for (int i = 0; i < 1; i++) {
error = pthread_join(thread[i], NULL);
assert(error == 0);
}
(*iterations)++;
}
}
v3:
- fix some whitespace and one typo
- slightly reword the ENOMEM comment
- move i-- in the first loop towards the end for consistency with the
other loop
- 2 extra unlikely for initial error conditions
I retained Oleg's r-b as the changes don't affect behavior
v2:
- cosmetic fixes from Oleg
- drop idr_preload_many, relock pidmap + call idr_preload again instead
- write a commit message
Mateusz Guzik (2):
ns: pad refcount
pid: only take pidmap_lock once on alloc
include/linux/ns/ns_common_types.h | 4 +-
kernel/pid.c | 134 ++++++++++++++++++-----------
2 files changed, 89 insertions(+), 49 deletions(-)
--
2.48.1
On Sat, 06 Dec 2025 14:19:53 +0100, Mateusz Guzik wrote:
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
>
> Benchmark code at the end.
>
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
>
> [...]
Applied to the kernel-7.0.misc branch of the vfs/vfs.git tree.
Patches in the kernel-7.0.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: kernel-7.0.misc
[1/2] ns: pad refcount
https://git.kernel.org/vfs/vfs/c/0832219c84f2
[2/2] pid: only take pidmap_lock once on alloc
https://git.kernel.org/vfs/vfs/c/fb374f0d6fc6
On Sat, Dec 6, 2025 at 2:20 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
>
> Benchmark code at the end.
>
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
>
> I believe idr is a poor choice for the task at hand to begin with, but
> sorting out that out beyond the scope of this patchset. At the same time
> any replacement would be best evaluated against a state where the
> above relock problem is fixed.
>
> Performance improvement varies between reboots. When benchmarking with
> 20 processes creating and killing threads in a loop, the unpatched
> baseline hovers around 465k ops/s, while patched is anything between
> ~510k ops/s and ~560k depending on false-sharing (which I only minimally
> sanitized). So this is at least 10% if you are unlucky.
>
I had another look at the problem concerning steps after this patchset.
I found the primary problem is pidfs support -- commenting it out
gives me about 40% boost, afterwards top of the profile is cgroups
with its 3 lock acquires per thread life cycle:
@[
__pv_queued_spin_lock_slowpath+1
_raw_spin_lock_irqsave+45
cgroup_task_dead+33
finish_task_switch.isra.0+555
schedule_tail+11
ret_from_fork+27
ret_from_fork_asm+26
]: 2550200
@[
__pv_queued_spin_lock_slowpath+1
_raw_spin_lock_irq+38
cgroup_post_fork+57
copy_process+5993
kernel_clone+148
__do_sys_clone3+188
do_syscall_64+78
entry_SYSCALL_64_after_hwframe+118
]: 3486368
@[
__pv_queued_spin_lock_slowpath+1
_raw_spin_lock_irq+38
cgroup_can_fork+110
copy_process+4940
kernel_clone+148
__do_sys_clone3+188
do_syscall_64+78
entry_SYSCALL_64_after_hwframe+118
]: 3487665
currently the pidfs thing is implemented with a red black tree.
Whatever the replacement it should be faster and have its own
non-global locking.
I don't know what's available in the kernel to deal with it instead.
Is it rhashtable?
I would not mind whatsoever if someone else dealt with it. :-)
> bench from will-it-scale:
>
> #include <assert.h>
> #include <pthread.h>
>
> char *testcase_description = "Thread creation and teardown";
>
> static void *worker(void *arg)
> {
> return (NULL);
> }
>
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
> pthread_t thread[1];
> int error;
>
> while (1) {
> for (int i = 0; i < 1; i++) {
> error = pthread_create(&thread[i], NULL, worker, NULL);
> assert(error == 0);
> }
> for (int i = 0; i < 1; i++) {
> error = pthread_join(thread[i], NULL);
> assert(error == 0);
> }
> (*iterations)++;
> }
> }
>
> v3:
> - fix some whitespace and one typo
> - slightly reword the ENOMEM comment
> - move i-- in the first loop towards the end for consistency with the
> other loop
> - 2 extra unlikely for initial error conditions
>
> I retained Oleg's r-b as the changes don't affect behavior
>
> v2:
> - cosmetic fixes from Oleg
> - drop idr_preload_many, relock pidmap + call idr_preload again instead
> - write a commit message
>
> Mateusz Guzik (2):
> ns: pad refcount
> pid: only take pidmap_lock once on alloc
>
> include/linux/ns/ns_common_types.h | 4 +-
> kernel/pid.c | 134 ++++++++++++++++++-----------
> 2 files changed, 89 insertions(+), 49 deletions(-)
>
> --
> 2.48.1
>
© 2016 - 2026 Red Hat, Inc.