[v1] further damage-control lack of clone scalability

[PATCH 0/3] further damage-control lack of clone scalability

Posted by Mateusz Guzik 2 months, 2 weeks ago

When spawning and killing threads in separate processes in parallel the
primary bottleneck on the stock kernel is pidmap_lock, largely because
of a back-to-back acquire in the common case.

Benchmark code at the end.

With this patchset alloc_pid() only takes the lock once and consequently
alleviates the problem. While scalability improves, the lock remains the
primary bottleneck by a large margin.

I believe idr is a poor choice for the task at hand to begin with, but
sorting out that out beyond the scope of this patchset. At the same time
any replacement would be best evaluated against a state where the
above relock problem is fixed.

Performance improvement varies between reboots. When benchmarking with
20 processes creating and killing threads in a loop, the unpatched
baseline hovers around 465k ops/s, while patched is anything between
~510k ops/s and ~560k depending on false-sharing (which I only minimally
sanitized). So this is at least 10% if you are unlucky.

bench from will-it-scale:

#include <assert.h>
#include <pthread.h>

char *testcase_description = "Thread creation and teardown";

static void *worker(void *arg)
{
        return (NULL);
}

void testcase(unsigned long long *iterations, unsigned long nr)
{
        pthread_t thread[1];
        int error;

        while (1) {
                for (int i = 0; i < 1; i++) {
                        error = pthread_create(&thread[i], NULL, worker, NULL);
                        assert(error == 0);
                }
                for (int i = 0; i < 1; i++) {
                        error = pthread_join(thread[i], NULL);
                        assert(error == 0);
                }
                (*iterations)++;
        }
}


Mateusz Guzik (3):
  idr: add idr_prealloc_many
  ns: pad refcount
  pid: only take pidmap_lock once on alloc

 include/linux/idr.h                |  1 +
 include/linux/ns/ns_common_types.h |  4 +-
 kernel/pid.c                       | 99 +++++++++++++++++-------------
 lib/radix-tree.c                   | 19 +++++-
 4 files changed, 76 insertions(+), 47 deletions(-)

-- 
2.48.1

Re: [PATCH 0/3] further damage-control lack of clone scalability

Posted by Matthew Wilcox 2 months, 2 weeks ago

On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> When spawning and killing threads in separate processes in parallel the
> primary bottleneck on the stock kernel is pidmap_lock, largely because
> of a back-to-back acquire in the common case.
> 
> Benchmark code at the end.
> 
> With this patchset alloc_pid() only takes the lock once and consequently
> alleviates the problem. While scalability improves, the lock remains the
> primary bottleneck by a large margin.
> 
> I believe idr is a poor choice for the task at hand to begin with, but
> sorting out that out beyond the scope of this patchset. At the same time
> any replacement would be best evaluated against a state where the
> above relock problem is fixed.

Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
the XArray is quite appropriate for this usecase.  I am opposed to
introducing more IDR APIs.  Have you looked at converting to the XArray?
Or do you have a better data structure in mind than the XArray?

Re: [PATCH 0/3] further damage-control lack of clone scalability

Posted by Mateusz Guzik 2 months ago

On Sun, Nov 23, 2025 at 4:00 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> > When spawning and killing threads in separate processes in parallel the
> > primary bottleneck on the stock kernel is pidmap_lock, largely because
> > of a back-to-back acquire in the common case.
> >
> > Benchmark code at the end.
> >
> > With this patchset alloc_pid() only takes the lock once and consequently
> > alleviates the problem. While scalability improves, the lock remains the
> > primary bottleneck by a large margin.
> >
> > I believe idr is a poor choice for the task at hand to begin with, but
> > sorting out that out beyond the scope of this patchset. At the same time
> > any replacement would be best evaluated against a state where the
> > above relock problem is fixed.
>
> Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
> the XArray is quite appropriate for this usecase.  I am opposed to
> introducing more IDR APIs.  Have you looked at converting to the XArray?
> Or do you have a better data structure in mind than the XArray?
>

Hi Willy,

in other responses I outlined what I suspect would be a viable long
term solution, very much not idr-based.

However, that's not something I'm likely to implement anytime soon and
I doubt there is someone willing to pick up the matter.

Whatever the long term solution and who/when implements it, the
current code avoidably loses out on some performance because of at
least two lock acquires on each fork instead of one.

At the moment it is structured in a way which makes it possible to
take the lock once with minor effort.

Leaving this in the current state results in a minor risk that someone
will make changes which turn fixing the scalability issue into a
massive problem.

With my patch as-is I can suffer some pain and avoid modifying
idr_prealloc, but at the same time I don't think the modification at
hand is particularly problematic. Notably it does not change any of
the internals.

So the question is if by "opposed to introducing more IDR APIs" you
mean you are just not fond of it, or is it a straight up NAK. Per the
explanation above, I think the change is tolerable in its own right
and I provided reasoning why I'm adding it in the first place.

If the latter, I'll see about massaging this to drop locks and retry
memory alloc.

Re: [PATCH 0/3] further damage-control lack of clone scalability

Posted by Mateusz Guzik 2 months ago

On Wed, Dec 3, 2025 at 9:37 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Sun, Nov 23, 2025 at 4:00 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> > > When spawning and killing threads in separate processes in parallel the
> > > primary bottleneck on the stock kernel is pidmap_lock, largely because
> > > of a back-to-back acquire in the common case.
> > >
> > > Benchmark code at the end.
> > >
> > > With this patchset alloc_pid() only takes the lock once and consequently
> > > alleviates the problem. While scalability improves, the lock remains the
> > > primary bottleneck by a large margin.
> > >
> > > I believe idr is a poor choice for the task at hand to begin with, but
> > > sorting out that out beyond the scope of this patchset. At the same time
> > > any replacement would be best evaluated against a state where the
> > > above relock problem is fixed.
> >
> > Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
> > the XArray is quite appropriate for this usecase.  I am opposed to
> > introducing more IDR APIs.  Have you looked at converting to the XArray?
> > Or do you have a better data structure in mind than the XArray?
> >
>
> Hi Willy,
>
> in other responses I outlined what I suspect would be a viable long
> term solution, very much not idr-based.
>
> However, that's not something I'm likely to implement anytime soon and
> I doubt there is someone willing to pick up the matter.
>
> Whatever the long term solution and who/when implements it, the
> current code avoidably loses out on some performance because of at
> least two lock acquires on each fork instead of one.
>
> At the moment it is structured in a way which makes it possible to
> take the lock once with minor effort.
>
> Leaving this in the current state results in a minor risk that someone
> will make changes which turn fixing the scalability issue into a
> massive problem.
>
> With my patch as-is I can suffer some pain and avoid modifying
> idr_prealloc, but at the same time I don't think the modification at
> hand is particularly problematic. Notably it does not change any of
> the internals.
>
> So the question is if by "opposed to introducing more IDR APIs" you
> mean you are just not fond of it, or is it a straight up NAK. Per the
> explanation above, I think the change is tolerable in its own right
> and I provided reasoning why I'm adding it in the first place.
>
> If the latter, I'll see about massaging this to drop locks and retry
> memory alloc.

Welp, retrying preload turned out to be significantly less painful
than I thought, so you can consider the above question moot. I'll be
posting v2 without idr changes.

Re: [PATCH 0/3] further damage-control lack of clone scalability

Posted by Mateusz Guzik 2 months, 2 weeks ago

On Sun, Nov 23, 2025 at 4:00 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Nov 23, 2025 at 07:30:51AM +0100, Mateusz Guzik wrote:
> > When spawning and killing threads in separate processes in parallel the
> > primary bottleneck on the stock kernel is pidmap_lock, largely because
> > of a back-to-back acquire in the common case.
> >
> > Benchmark code at the end.
> >
> > With this patchset alloc_pid() only takes the lock once and consequently
> > alleviates the problem. While scalability improves, the lock remains the
> > primary bottleneck by a large margin.
> >
> > I believe idr is a poor choice for the task at hand to begin with, but
> > sorting out that out beyond the scope of this patchset. At the same time
> > any replacement would be best evaluated against a state where the
> > above relock problem is fixed.
>
> Good news!  The IDR is deprecated.  Bad news!  I'm not 100% sure that
> the XArray is quite appropriate for this usecase.  I am opposed to
> introducing more IDR APIs.  Have you looked at converting to the XArray?
> Or do you have a better data structure in mind than the XArray?
>

I have some recollection we talked about this on irc long time ago.

It is my *suspicion* this would be best served with a sparse bitmap +
a hash table.

Such a solution was already present, but it got replaced by
95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
API").

Commit message cites the following bench results:
    The following are the stats for ps, pstree and calling readdir on /proc
    for 10,000 processes.

    ps:
            With IDR API    With bitmap
    real    0m1.479s        0m2.319s
    user    0m0.070s        0m0.060s
    sys     0m0.289s        0m0.516s

    pstree:
            With IDR API    With bitmap
    real    0m1.024s        0m1.794s
    user    0m0.348s        0m0.612s
    sys     0m0.184s        0m0.264s

    proc:
            With IDR API    With bitmap
    real    0m0.059s        0m0.074s
    user    0m0.000s        0m0.004s
    sys     0m0.016s        0m0.016s

Impact on clone was not benchmarked afaics.

It may be a hash table is a bad pick here, but the commit does not
explain what (if anything) was done to try to make it work. As is I
suspect both sizing of the table itself and the hashing algo were
suboptimal for the above test.

I did find a patchset converting the thing to xarray here:
https://lore.kernel.org/all/20221202171620.509140-1-bfoster@redhat.com/
. It looks like it would need a mildly annoying rebase.

The patchset does not come with any numbers for forking either and it
predates other unscrewing I did in the area (which notably moved
tasklist lock away from being the primary bottleneck). If I'm reading
it correctly it comes with lock trips per ns level. This adds overhead
both from single-threaded and multi-threaded standpoint, in particular
still having multiple acquires of the same lock. If it is fine to use
xarray and retain pidmap_lock that's a different story, I have not
looked into it.

Regardless, in order to give whatever replacement a fair perf eval
against idr, at least the following 2 bits need to get sorted out:
- the self-induced repeat locking of pidmap_lock
- high cost of kmalloc (to my understanding waiting for sheaves4all)

Re: [PATCH 0/3] further damage-control lack of clone scalability

Posted by Matthew Wilcox 2 months, 2 weeks ago

On Sun, Nov 23, 2025 at 05:39:16PM +0100, Mateusz Guzik wrote:
> I have some recollection we talked about this on irc long time ago.
> 
> It is my *suspicion* this would be best served with a sparse bitmap +
> a hash table.

Maybe!  I've heard other people speculate that would be a better data
structure.  I know we switched away from a hash table for the page
cache, but that has a different usage pattern where it's common to go
from page N to page N+1, N+2, ...  Other than ps, I don't think we often
have that pattern for PIDs.

> Such a solution was already present, but it got replaced by
> 95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
> API").
> 
> Commit message cites the following bench results:
>     The following are the stats for ps, pstree and calling readdir on /proc
>     for 10,000 processes.
> 
>     ps:
>             With IDR API    With bitmap
>     real    0m1.479s        0m2.319s
>     user    0m0.070s        0m0.060s
>     sys     0m0.289s        0m0.516s
> 
>     pstree:
>             With IDR API    With bitmap
>     real    0m1.024s        0m1.794s
>     user    0m0.348s        0m0.612s
>     sys     0m0.184s        0m0.264s
> 
>     proc:
>             With IDR API    With bitmap
>     real    0m0.059s        0m0.074s
>     user    0m0.000s        0m0.004s
>     sys     0m0.016s        0m0.016s
> 
> Impact on clone was not benchmarked afaics.

It shouldn't be too much effort for you to check out 95846ecf9dac5089
and 95846ecf9dac5089^ to run your benchmark on both?  That would seem
like the cheapest way of assessing the performance of hash+bitmap
vs IDR.

> Regardless, in order to give whatever replacement a fair perf eval
> against idr, at least the following 2 bits need to get sorted out:
> - the self-induced repeat locking of pidmap_lock
> - high cost of kmalloc (to my understanding waiting for sheaves4all)

The nice thing about XArray (compared to IDR) is that there's no
requirement to preallocate.  Only 1.6% of xa_alloc() calls result in
calling slab.  The downside is that means that XArray needs to know
where its lock is (ie xa_lock) so that it can drop the lock in order to
allocate without using GFP_ATOMIC.

At one point I kind of had a plan to create a multi-xarray where you had
multiple xarrays that shared a single lock.  Or maybe this sharding is
exactly what's needed; I haven't really analysed the pid locking to see
what's needed.

Re: [PATCH 0/3] further damage-control lack of clone scalability

Posted by Mateusz Guzik 2 months, 2 weeks ago

On Sun, Nov 23, 2025 at 10:45 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Nov 23, 2025 at 05:39:16PM +0100, Mateusz Guzik wrote:
> > I have some recollection we talked about this on irc long time ago.
> >
> > It is my *suspicion* this would be best served with a sparse bitmap +
> > a hash table.
>
> Maybe!  I've heard other people speculate that would be a better data
> structure.  I know we switched away from a hash table for the page
> cache, but that has a different usage pattern where it's common to go
> from page N to page N+1, N+2, ...  Other than ps, I don't think we often
> have that pattern for PIDs.
>

Well it does seem like the obvious choice to try, and like I said it
even was tried in Linux. It also happens to be what FreeBSD is doing
and it gets better scalability, despite other global serialization
points (that is to say both kernels *suck* on this front, just so
happens FreeBSD sucks less).

> > Such a solution was already present, but it got replaced by
> > 95846ecf9dac5089 ("pid: replace pid bitmap implementation with IDR
> > API").
> >
> > Commit message cites the following bench results:
> >     The following are the stats for ps, pstree and calling readdir on /proc
> >     for 10,000 processes.
> >
> >     ps:
> >             With IDR API    With bitmap
> >     real    0m1.479s        0m2.319s
> >     user    0m0.070s        0m0.060s
> >     sys     0m0.289s        0m0.516s
> >
> >     pstree:
> >             With IDR API    With bitmap
> >     real    0m1.024s        0m1.794s
> >     user    0m0.348s        0m0.612s
> >     sys     0m0.184s        0m0.264s
> >
> >     proc:
> >             With IDR API    With bitmap
> >     real    0m0.059s        0m0.074s
> >     user    0m0.000s        0m0.004s
> >     sys     0m0.016s        0m0.016s
> >
> > Impact on clone was not benchmarked afaics.
>
> It shouldn't be too much effort for you to check out 95846ecf9dac5089
> and 95846ecf9dac5089^ to run your benchmark on both?  That would seem
> like the cheapest way of assessing the performance of hash+bitmap
> vs IDR.
>

The commit is 2017 vintage and that kernel has some problems I fixed
in the area. The commit is not easily revertable either.

Even then the hash as implemented at the time looks suspicious.

I best way forward it to slap together a poc consisting of the obvious
vmalloced bitmap for the entire size + hasthable and see if that has
legs. If so, then one can spend time making it memory-efficient.

> > Regardless, in order to give whatever replacement a fair perf eval
> > against idr, at least the following 2 bits need to get sorted out:
> > - the self-induced repeat locking of pidmap_lock
> > - high cost of kmalloc (to my understanding waiting for sheaves4all)
>
> The nice thing about XArray (compared to IDR) is that there's no
> requirement to preallocate.  Only 1.6% of xa_alloc() calls result in
> calling slab.

Technically there is no requirement in idr either, as it also kmallocs
with GFP_ATOMIC inside.

> At one point I kind of had a plan to create a multi-xarray where you had
> multiple xarrays that shared a single lock.  Or maybe this sharding is
> exactly what's needed; I haven't really analysed the pid locking to see
> what's needed.

So ignoring bottlenecks outside of pid management, there is the global
pidmap_lock. The xarray patchset I linked above technically speaking
provides per-namespace locking, but that does not cut it as you have
to lock all namespaces anyway, including the global one.

Ultimately in order to makes this scale CPUs need to stop sharing the
locks (in the common case anyway). To that end PID space needs to get
partitioned, with ranges allocated to small sets of CPUs (say 8 or 10
or similar -- small enough for contention to not matter outside of a
microbenchmark). The ID space on Linux is enormous, so this should not
be a problem. The only potential issue is that PIDs no longer will be
sequential which is userspace-visible (I mean, suppose a wanker knows
for a fact nobody is forking and insists on taking advantage of it (by
Hyrum's law), then his two forks in a row have predictable IDs).
Anyhow, then most real workloads should be able to allocate IDs
without depleting the range on whatever they happen to be running on
(but deallocating may still need to look at other CPU ranges).

With something like a bitmap + hash this is trivially achievable --
just have a set of bitmaps and have each assigned a 'base' which you
add/subtract on the id obtained from the given bitmap. The hash is a
non-problem.

I don't know about xarray. Bottom line though, even if one was to pass
a lock around to xarray primitives but without sorting out the range
lock problem, that would merely be a stopgap.

All that said, I don't know if/when I'll get around to something this.

I think my addition to the idr api is trivial and absent someone
volunteering to do the needful(tm) sooner than later it should not be
considered a blocker.

My worry is, should this patchset get stalled, someone is going to
swoop in and submit changes which make it infeasible to only take the
lock once.

Re: [PATCH 0/3] further damage-control lack of clone scalability

Posted by Mateusz Guzik 2 months, 2 weeks ago

On Sun, Nov 23, 2025 at 11:33 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> Ultimately in order to makes this scale CPUs need to stop sharing the
> locks (in the common case anyway). To that end PID space needs to get
> partitioned, with ranges allocated to small sets of CPUs (say 8 or 10
> or similar -- small enough for contention to not matter outside of a
> microbenchmark). The ID space on Linux is enormous, so this should not
> be a problem. The only potential issue is that PIDs no longer will be
> sequential which is userspace-visible (I mean, suppose a wanker knows
> for a fact nobody is forking and insists on taking advantage of it (by
> Hyrum's law), then his two forks in a row have predictable IDs).
> Anyhow, then most real workloads should be able to allocate IDs
> without depleting the range on whatever they happen to be running on
> (but deallocating may still need to look at other CPU ranges).
>
> With something like a bitmap + hash this is trivially achievable --
> just have a set of bitmaps and have each assigned a 'base' which you
> add/subtract on the id obtained from the given bitmap. The hash is a
> non-problem.
>

So I had a little bit of a think and I got something and it boils down
to special casing the last level of a (quasi)radix tree. It provides
the scalability I was looking for, albeit with some uglification. This
is a sketch.

Note that as port of allocation policy the kernel must postpone pids
reuse, as in you can't just free/alloc in a tiny range and call it a
day.

Part of the issue for me is that there are 32 levels of allowed
namespaces. The stock code will relock pidmap_lock for every single
one(!) on alloc, this is just terrible. Suppose one introduces
per-namespace locks, that's still 32 lock trips to grab a pid and that
still does not solve the scalability problem. For my taste that's
questionable at best, but at the same time this is what the kernel is
already doing, so let's pretend for a minute the relocks are not a
concern.

The solution is based on (quasi)radix where the datum is a pointer to
a struct containing a spinlock, bitmap and array of pids. Likely will
be a xarray, but I'm going to keep typing radix for the time being as
this is the concept which matters.

The struct contains a carved out id range and can fit -- say -- 250
entries or so, or whatever else which fits in a size considered
fine(tm). The actual pid is obtained by adding up the radix id (which
serves as a prefix) and the offset into the array.

In order to alloc a pid for a given namespace, the calling CPU would
check if it has a range carved out. If so, it locks the thing and
looks for a free pid. Absent a free pid or a range in the first place
it goes to xarray to get space. This avoids synchronisation with other
CPUs for the 250 forks (modulo a thread with an id from this range
exiting on a different cpu), which sorts out the scalability problem
in practical settings.

Of course once someone notices that there are no IDs in use anymore
*and* the last id was handed out at some point, the range gets freed.

But you have to do the locking for every ns.

So let's say it is in fact a problem and it would be most preferred if
the CPU could take *one* lock and stick with it for all namespaces,
all while retaining scalability.

Instead, every CPU could have its own pidmap_lock. The struct with the
bitmap + pid array would have a pointer to a spinlock, which would
refer to the pidmap lock of whichever CPU which allocated the range.

Et voila, allocs still get away with one lock acquire in the common
case where there are free ids in all namespaces which need to be
touched. Contention is only present on xarray locks if you ran out of
space *or* on the pidmap lock if someone is freeing an id. Hell, this
can probably be made lockless to further speed it up if need be.
However, lockless or not, the key point is that most allocs will *not*
be bouncing any cachelines.