Optimize rss_stat initialization/teardown for single-threaded tasks

[RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Gabriel Krisman Bertazi 2 months, 1 week ago

The cost of the pcpu memory allocation is non-negligible for systems
with many cpus, and it is quite visible when forking a new task, as
reported in a few occasions.  In particular, Jan Kara reported the
commit introducing per-cpu counters for rss_stat caused a 10% regression
of system time for gitsource in his system [1].  In that same occasion,
Jan suggested we special-cased the single-threaded case: since we know
there won't be frequent remote updates of rss_stats for single-threaded
applications, we could special case it with a local counter for most
updates, and an atomic counter for the infrequent remote updates.  This
patchset implements this idea.

It exposes a dual-mode counter that starts as a simple counter, cheap to
initialize on single-threaded tasks, that can be upgraded inflight to a
fully-fledged per cpu counter later.  Patch 3 then modifies the rss_stat
counters to use that structure, forcing the upgrade as soon as a second
task sharing the mm_struct is spawned.  By delaying the initialization
cost until the MM is shared, we cover single-threaded applications
fairly cheaply, while not penalizing applications that spawn multiple
threads.  On a 256c system, where the pcpu allocation of the rss_stats
is quite noticeable, this has reduced the wall-clock time between 6%
15% (depending on the number of cores) of an artificial fork-intensive
microbenchmark (calling /bin/true in a loop).  In a more realistic
benchmark, it showed an improvement of 1.5% on kernbench elapsed time.

More performance data, including profilings is available in the patch
modifying the rss_stat counters.

While this patch exposes a single users of this API, this should be
useful in more cases.  This is why I made it into a proper API.  In
addition, considering the recent efforts in this area, such as
hierarchical per-cpu counters which are orthogonal to this work because
they improve multi-threaded workloads, abstracting this with a new API
could help the merging of both works.

Finally, this is a RFC because it is an early work. in particular, I'd
be interested in more benchmarks suggestions, and I'd like feedback
whether this new interface should be implemented inside percpu_counters
as lazy counters or as a completely separated interface.

Thanks,

[1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3

---

Cc: linux-kernel@vger.kernel.org
Cc: jack@suse.cz
Cc: Mateusz Guzik <mjguzik@gmail.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@gentwo.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>

Gabriel Krisman Bertazi (4):
  lib/percpu_counter: Split out a helper to insert into hotplug list
  lib: Support lazy initialization of per-cpu counters
  mm: Avoid percpu MM counters on single-threaded tasks
  mm: Split a slow path for updating mm counters

 arch/s390/mm/gmap_helpers.c         |   4 +-
 arch/s390/mm/pgtable.c              |   4 +-
 fs/exec.c                           |   2 +-
 include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++
 include/linux/mm.h                  |  26 ++---
 include/linux/mm_types.h            |   4 +-
 include/linux/percpu_counter.h      |   5 +-
 include/trace/events/kmem.h         |   4 +-
 kernel/events/uprobes.c             |   2 +-
 kernel/fork.c                       |  14 ++-
 lib/percpu_counter.c                |  68 ++++++++++---
 mm/filemap.c                        |   2 +-
 mm/huge_memory.c                    |  22 ++---
 mm/khugepaged.c                     |   6 +-
 mm/ksm.c                            |   2 +-
 mm/madvise.c                        |   2 +-
 mm/memory.c                         |  20 ++--
 mm/migrate.c                        |   2 +-
 mm/migrate_device.c                 |   2 +-
 mm/rmap.c                           |  16 +--
 mm/swapfile.c                       |   6 +-
 mm/userfaultfd.c                    |   2 +-
 22 files changed, 276 insertions(+), 84 deletions(-)
 create mode 100644 include/linux/lazy_percpu_counter.h

-- 
2.51.0

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mathieu Desnoyers 2 months, 1 week ago

On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote:
> The cost of the pcpu memory allocation is non-negligible for systems
> with many cpus, and it is quite visible when forking a new task, as
> reported in a few occasions.
I've come to the same conclusion within the development of
the hierarchical per-cpu counters.

But while the mm_struct has a SLAB cache (initialized in
kernel/fork.c:mm_cache_init()), there is no such thing
for the per-mm per-cpu data.

In the mm_struct, we have the following per-cpu data (please
let me know if I missed any in the maze):

- struct mm_cid __percpu *pcpu_cid (or equivalent through
   struct mm_mm_cid after Thomas Gleixner gets his rewrite
   upstream),

- unsigned int __percpu *futex_ref,

- NR_MM_COUNTERS rss_stats per-cpu counters.

What would really reduce memory allocation overhead on fork
is to move all those fields into a top level
"struct mm_percpu_struct" as a first step. This would
merge 3 per-cpu allocations into one when forking a new
task.

Then the second step is to create a mm_percpu_struct
cache to bypass the per-cpu allocator.

I suspect that by doing just that we'd get most of the
performance benefits provided by the single-threaded special-case
proposed here.

I'm not against special casing single-threaded if it's still
worth it after doing the underlying data structure layout/caching
changes I'm proposing here, but I think we need to fix the
memory allocation overhead issue first before working around it
with special cases and added complexity.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Jan Kara 2 months, 1 week ago

On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote:
> > The cost of the pcpu memory allocation is non-negligible for systems
> > with many cpus, and it is quite visible when forking a new task, as
> > reported in a few occasions.
> I've come to the same conclusion within the development of
> the hierarchical per-cpu counters.
> 
> But while the mm_struct has a SLAB cache (initialized in
> kernel/fork.c:mm_cache_init()), there is no such thing
> for the per-mm per-cpu data.
> 
> In the mm_struct, we have the following per-cpu data (please
> let me know if I missed any in the maze):
> 
> - struct mm_cid __percpu *pcpu_cid (or equivalent through
>   struct mm_mm_cid after Thomas Gleixner gets his rewrite
>   upstream),
> 
> - unsigned int __percpu *futex_ref,
> 
> - NR_MM_COUNTERS rss_stats per-cpu counters.
> 
> What would really reduce memory allocation overhead on fork
> is to move all those fields into a top level
> "struct mm_percpu_struct" as a first step. This would
> merge 3 per-cpu allocations into one when forking a new
> task.
> 
> Then the second step is to create a mm_percpu_struct
> cache to bypass the per-cpu allocator.
> 
> I suspect that by doing just that we'd get most of the
> performance benefits provided by the single-threaded special-case
> proposed here.

I don't think so. Because in the profiles I have been doing for these
loads the biggest cost wasn't actually the per-cpu allocation itself but
the cost of zeroing the allocated counter for many CPUs (and then the
counter summarization on exit) and you're not going to get rid of that with
just reshuffling per-cpu fields and adding slab allocator in front.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mateusz Guzik 2 months, 1 week ago

On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > What would really reduce memory allocation overhead on fork
> > is to move all those fields into a top level
> > "struct mm_percpu_struct" as a first step. This would
> > merge 3 per-cpu allocations into one when forking a new
> > task.
> >
> > Then the second step is to create a mm_percpu_struct
> > cache to bypass the per-cpu allocator.
> >
> > I suspect that by doing just that we'd get most of the
> > performance benefits provided by the single-threaded special-case
> > proposed here.
>
> I don't think so. Because in the profiles I have been doing for these
> loads the biggest cost wasn't actually the per-cpu allocation itself but
> the cost of zeroing the allocated counter for many CPUs (and then the
> counter summarization on exit) and you're not going to get rid of that with
> just reshuffling per-cpu fields and adding slab allocator in front.
>

The entire ordeal has been discussed several times already. I'm rather
disappointed there is a new patchset posted which does not address any
of it and goes straight to special-casing single-threaded operation.

The major claims (by me anyway) are:
1. single-threaded operation for fork + exec suffers avoidable
overhead even without the rss counter problem, which are tractable
with the same kind of thing which would sort out the multi-threaded
problem
2. unfortunately there is an increasing number of multi-threaded (and
often short lived) processes (example: lld, the linker form the llvm
project; more broadly plenty of things Rust where people think
threading == performance)

Bottom line is, solutions like the one proposed in the patchset are at
best a stopgap and even they leave performance on the table for the
case they are optimizing for.

The pragmatic way forward (as I see it anyway) is to fix up the
multi-threaded thing and see if trying to special case for
single-threaded case is justifiable afterwards.

Given that the current patchset has to resort to atomics in certain
cases, there is some error-pronnes and runtime overhead associated
with it going beyond merely checking if the process is
single-threaded, which puts an additional question mark on it.

Now to business:
You mentioned the rss loops are a problem. I agree, but they can be
largely damage-controlled. More importantly there are 2 loops of the
sort already happening even with the patchset at hand.

mm_alloc_cid() results in one loop in the percpu allocator to zero out
the area, then mm_init_cid() performs the following:
        for_each_possible_cpu(i) {
                struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);

                pcpu_cid->cid = MM_CID_UNSET;
                pcpu_cid->recent_cid = MM_CID_UNSET;
                pcpu_cid->time = 0;
        }

There is no way this is not visible already on 256 threads.

Preferably some magic would be done to init this on first use on given
CPU.There is some bitmap tracking CPU presence, maybe this can be
tackled on top of it. But for the sake of argument let's say that's
too expensive or perhaps not feasible. Even then, the walk can be done
*once* by telling the percpu allocator to refrain from zeroing memory.

Which brings me to rss counters. In the current kernel that's
*another* loop over everything to zero it out. But it does not have to
be that way. Suppose bitmap shenanigans mentioned above are no-go for
these as well.

So instead the code could reach out to the percpu allocator to
allocate memory for both cid and rss (as mentined by Mathieu), but
have it returned uninitialized and loop over it once sorting out both
cid and rss in the same body. This should be drastically faster than
the current code.

But one may observe it is an invariant the values sum up to 0 on process exit.

So if one was to make sure the first time this is handed out by the
percpu allocator the values are all 0s and then cache the area
somewhere for future allocs/frees of mm, there would be no need to do
the zeroing on alloc.

On the free side summing up rss counters in check_mm() is only there
for debugging purposes. Suppose it is useful enough that it needs to
stay. Even then, as implemented right now, this is just slow for no
reason:

        for (i = 0; i < NR_MM_COUNTERS; i++) {
                long x = percpu_counter_sum(&mm->rss_stat[i]);
[snip]
        }

That's *four* loops with extra overhead of irq-trips for every single
one. This can be patched up to only do one loop, possibly even with
irqs enabled the entire time.

Doing the loop is still slower than not doing it, but his may be just
fast enough to obsolete the ideas like in the proposed patchset.

While per-cpu level caching for all possible allocations seems like
the easiest way out, it in fact does *NOT* fully solve problem -- you
are still going to globally serialize in lru_gen_add_mm() (and the del
part), pgd_alloc() and other places.

Or to put it differently, per-cpu caching of mm_struct itself makes no
sense in the current kernel (with the patchset or not) because on the
way to finish the alloc or free you are going to globally serialize
several times and *that* is the issue to fix in the long run. You can
make the problematic locks fine-grained (and consequently alleviate
the scalability aspect), but you are still going to suffer the
overhead of taking them.

As far as I'm concerned the real long term solution(tm) would make the
cached mm's retain the expensive to sort out state -- list presence,
percpu memory and whatever else.

To that end I see 2 feasible approaches:
1. a dedicated allocator with coarse granularity

Instead of per-cpu, you could have an instance for every n threads
(let's say 8 or whatever). this would pose a tradeoff between total
memory usage and scalability outside of a microbenchmark setting. you
are still going to serialize in some cases, but only once on alloc and
once on free, not several times and you are still cheaper
single-threaded. This is faster all around.

2. dtor support in the slub allocator

ctor does the hard work and dtor undoes it. There is an unfinished
patchset by Harry which implements the idea[1].

There is a serious concern about deadlock potential stemming from
running arbitrary dtor code during memory reclaim. I already described
elsewhere how with a little bit of discipline supported by lockdep
this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
take any locks if you hold them and you have to disable interrupts) +
mark dtors as only allowed to hold a leaf spinlock et voila, code
guaranteed to not deadlock). But then all code trying to cache its
state in to be undone with dtor has to be patched to facilitate it.
Again bugs in the area sorted out by lockdep.

The good news is that folks were apparently open to punting reclaim of
such memory into a workqueue, which completely alleviates that concern
anyway.

So happens if fork + exit is involved there are numerous other
bottlenecks which overshadow the above, but that's a rant for another
day. Here we can pretend for a minute they are solved.

[1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Gabriel Krisman Bertazi 2 months, 1 week ago

Mateusz Guzik <mjguzik@gmail.com> writes:

> On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
>> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
>> > What would really reduce memory allocation overhead on fork
>> > is to move all those fields into a top level
>> > "struct mm_percpu_struct" as a first step. This would
>> > merge 3 per-cpu allocations into one when forking a new
>> > task.
>> >
>> > Then the second step is to create a mm_percpu_struct
>> > cache to bypass the per-cpu allocator.
>> >
>> > I suspect that by doing just that we'd get most of the
>> > performance benefits provided by the single-threaded special-case
>> > proposed here.
>>
>> I don't think so. Because in the profiles I have been doing for these
>> loads the biggest cost wasn't actually the per-cpu allocation itself but
>> the cost of zeroing the allocated counter for many CPUs (and then the
>> counter summarization on exit) and you're not going to get rid of that with
>> just reshuffling per-cpu fields and adding slab allocator in front.
>>

Hi Mateusz,

> The major claims (by me anyway) are:
> 1. single-threaded operation for fork + exec suffers avoidable
> overhead even without the rss counter problem, which are tractable
> with the same kind of thing which would sort out the multi-threaded
> problem

Agreed, there are more issues in the fork/exec path than just the
rss_stat.  The rss_stat performance is particularly relevant to us,
though, because it is a clear regression for single-threaded introduced
in 6.2.

I took the time to test the slab constructor approach with the
/sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
the 80c machine, which, granted, is an artificial benchmark, but still a
good stressor of the single-threaded case.  With this patchset, I
reported 6% improvement, getting it close to the performance before the
pcpu rss_stats introduction. This is expected, as avoiding the pcpu
allocation and initialization all together for the single-threaded case,
where it is not necessary, will always be better than speeding up the
allocation (even though that a worthwhile effort itself, as Mathieu
pointed out).

> 2. unfortunately there is an increasing number of multi-threaded (and
> often short lived) processes (example: lld, the linker form the llvm
> project; more broadly plenty of things Rust where people think
> threading == performance)

I don't agree with this argument, though.  Sure, there is an increasing
amount of multi-threaded applications, but this is not relevant.  The
relevant argument is the amount of single-threaded workloads. One
example are coreutils, which are spawned to death by scripts.  I did
take the care of testing the patchset with a full distro on my
day-to-day laptop and I wasn't surprised to see the vast majority of
forked tasks never fork a second thread.  The ones that do are most
often long-lived applications, where the cost of mm initialization is
way less relevant to the overall system performance.  Another example is
the fact real-world benchmarks, like kernbench, can be improved with
special-casing single-threads.

> The pragmatic way forward (as I see it anyway) is to fix up the
> multi-threaded thing and see if trying to special case for
> single-threaded case is justifiable afterwards.
>
> Given that the current patchset has to resort to atomics in certain
> cases, there is some error-pronnes and runtime overhead associated
> with it going beyond merely checking if the process is
> single-threaded, which puts an additional question mark on it.

I don't get why atomics would make it error-prone.  But, regarding the
runtime overhead, please note the main point of this approach is that
the hot path can be handled with a simple non-atomic variable write in
the task context, and not the atomic operation. The later is only used
for infrequent case where the counter is touched by an external task
such as OOM, khugepaged, etc.

>
> Now to business:

-- 
Gabriel Krisman Bertazi

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mateusz Guzik 2 months ago

On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote:
>
> Mateusz Guzik <mjguzik@gmail.com> writes:
> > The major claims (by me anyway) are:
> > 1. single-threaded operation for fork + exec suffers avoidable
> > overhead even without the rss counter problem, which are tractable
> > with the same kind of thing which would sort out the multi-threaded
> > problem
>
> Agreed, there are more issues in the fork/exec path than just the
> rss_stat.  The rss_stat performance is particularly relevant to us,
> though, because it is a clear regression for single-threaded introduced
> in 6.2.
>
> I took the time to test the slab constructor approach with the
> /sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
> the 80c machine, which, granted, is an artificial benchmark, but still a
> good stressor of the single-threaded case.  With this patchset, I
> reported 6% improvement, getting it close to the performance before the
> pcpu rss_stats introduction. This is expected, as avoiding the pcpu
> allocation and initialization all together for the single-threaded case,
> where it is not necessary, will always be better than speeding up the
> allocation (even though that a worthwhile effort itself, as Mathieu
> pointed out)

I'm fine with the benchmark method, but it was used on a kernel which
remains gimped by the avoidably slow walk in check_mm which I already
talked about.

Per my prior commentary and can be patched up to only do the walk once
instead of 4 times, and without taking locks.

But that's still more work than nothing and let's say that's still too
slow. 2 ideas were proposed how to avoid the walk altogether: I
proposed expanding the tlb bitmap and Mathieu went with the cid
machinery. Either way the walk over all CPUs is not there.

With the walk issue fixed and all allocations cached thanks ctor/dtor,
even the single-threaded fork/exec will be faster than it is with your
patch thanks to *never* reaching to the per-cpu allocator (with your
patch it is still going to happen for the cid stuff).

Additionally there are other locks which can be elided later with the
ctor/dtor pair, further improving perf.

>
> > 2. unfortunately there is an increasing number of multi-threaded (and
> > often short lived) processes (example: lld, the linker form the llvm
> > project; more broadly plenty of things Rust where people think
> > threading == performance)
>
> I don't agree with this argument, though.  Sure, there is an increasing
> amount of multi-threaded applications, but this is not relevant.  The
> relevant argument is the amount of single-threaded workloads. One
> example are coreutils, which are spawned to death by scripts.  I did
> take the care of testing the patchset with a full distro on my
> day-to-day laptop and I wasn't surprised to see the vast majority of
> forked tasks never fork a second thread.  The ones that do are most
> often long-lived applications, where the cost of mm initialization is
> way less relevant to the overall system performance.  Another example is
> the fact real-world benchmarks, like kernbench, can be improved with
> special-casing single-threads.
>

I stress one more time that a full fixup for the situation as I
described above not only gets rid of the problem for *both* single-
and multi- threaded operation, but ends up with code which is faster
than your patchset even for the case you are patching for.

The multi-threaded stuff *is* very much relevant because it is
increasingly more common (see below). I did not claim that
single-threaded workloads don't matter.

I would not be arguing here if there was no feasible way to handle
both or if handling the multi-threaded case still resulted in
measurable overhead for single-threaded workloads.

Since you mention configure scripts, I'm intimately familiar with
large-scale building as a workload. While it is true that there is
rampant usage of shell, sed and whatnot (all of which are
single-threaded), things turn multi-threaded (and short-lived) very
quickly once you go past the gnu toolchain and/or c/c++ codebases.

For example the llvm linker is multi-threaded and short-lived. Since
most real programs are small, during a large scale build of different
programs you end up with tons of lld spawning and quitting all the
time.

Beyond that java, erlang, zig and others like to multi-thread as well.

Rust is an emerging ecosystem where people think adding threading
equals automatically better performance and where crate authors think
it's fine to sneak in threads (my favourite offender is the ctrlc
crate). And since Rust is growing in popularity you can expect the
kind of single-threaded tooling you see right now will turn
multi-threaded from under you over time.

> > The pragmatic way forward (as I see it anyway) is to fix up the
> > multi-threaded thing and see if trying to special case for
> > single-threaded case is justifiable afterwards.
> >
> > Given that the current patchset has to resort to atomics in certain
> > cases, there is some error-pronnes and runtime overhead associated
> > with it going beyond merely checking if the process is
> > single-threaded, which puts an additional question mark on it.
>
> I don't get why atomics would make it error-prone.  But, regarding the
> runtime overhead, please note the main point of this approach is that
> the hot path can be handled with a simple non-atomic variable write in
> the task context, and not the atomic operation. The later is only used
> for infrequent case where the counter is touched by an external task
> such as OOM, khugepaged, etc.
>

The claim is there may be a bug where something should be using the
atomic codepath but is not.

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mateusz Guzik 2 months ago

On Wed, Dec 3, 2025 at 12:02 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Dec 1, 2025 at 4:23 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote:
> >
> > Mateusz Guzik <mjguzik@gmail.com> writes:
> > > The major claims (by me anyway) are:
> > > 1. single-threaded operation for fork + exec suffers avoidable
> > > overhead even without the rss counter problem, which are tractable
> > > with the same kind of thing which would sort out the multi-threaded
> > > problem
> >
> > Agreed, there are more issues in the fork/exec path than just the
> > rss_stat.  The rss_stat performance is particularly relevant to us,
> > though, because it is a clear regression for single-threaded introduced
> > in 6.2.
> >
> > I took the time to test the slab constructor approach with the
> > /sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
> > the 80c machine, which, granted, is an artificial benchmark, but still a
> > good stressor of the single-threaded case.  With this patchset, I
> > reported 6% improvement, getting it close to the performance before the
> > pcpu rss_stats introduction. This is expected, as avoiding the pcpu
> > allocation and initialization all together for the single-threaded case,
> > where it is not necessary, will always be better than speeding up the
> > allocation (even though that a worthwhile effort itself, as Mathieu
> > pointed out)
>
> I'm fine with the benchmark method, but it was used on a kernel which
> remains gimped by the avoidably slow walk in check_mm which I already
> talked about.
>
> Per my prior commentary and can be patched up to only do the walk once
> instead of 4 times, and without taking locks.
>
> But that's still more work than nothing and let's say that's still too
> slow. 2 ideas were proposed how to avoid the walk altogether: I
> proposed expanding the tlb bitmap and Mathieu went with the cid
> machinery. Either way the walk over all CPUs is not there.
>

So I got another idea and it boils down to coalescing cid init with
rss checks on exit.

I repeat that with your patchset the single-threaded case is left with
one walk on alloc (for cid stuff) and that's where issues arise for
machines with tons of cpus.

If the walk gets fixed, the same method can be used to avoid the walk
for rss, obsoleting the patchset.

So let's say it is unfixable for the time being.

mm_init_cid stores a bunch of -1 per-cpu. I'm assuming this can't be changed.

One can still handle allocation in ctor/dtor and make it an invariant
that the state present is ready to use, so in particular mm_init_cid
was already issued on it.

Then it is on the exit side to clean it up and this is where the walk
checks rss state *and* reinits cid in one loop.

Excluding the repeat lock and irq trips which don't need to be there,
I take it almost all of the overhead is cache misses. WIth one loop
that's sorted out.

Maybe I'm going to hack it up, but perhaps Mathieu or Harry would be
happy to do it? (or have a better idea?)

> With the walk issue fixed and all allocations cached thanks ctor/dtor,
> even the single-threaded fork/exec will be faster than it is with your
> patch thanks to *never* reaching to the per-cpu allocator (with your
> patch it is still going to happen for the cid stuff).
>
> Additionally there are other locks which can be elided later with the
> ctor/dtor pair, further improving perf.
>
> >
> > > 2. unfortunately there is an increasing number of multi-threaded (and
> > > often short lived) processes (example: lld, the linker form the llvm
> > > project; more broadly plenty of things Rust where people think
> > > threading == performance)
> >
> > I don't agree with this argument, though.  Sure, there is an increasing
> > amount of multi-threaded applications, but this is not relevant.  The
> > relevant argument is the amount of single-threaded workloads. One
> > example are coreutils, which are spawned to death by scripts.  I did
> > take the care of testing the patchset with a full distro on my
> > day-to-day laptop and I wasn't surprised to see the vast majority of
> > forked tasks never fork a second thread.  The ones that do are most
> > often long-lived applications, where the cost of mm initialization is
> > way less relevant to the overall system performance.  Another example is
> > the fact real-world benchmarks, like kernbench, can be improved with
> > special-casing single-threads.
> >
>
> I stress one more time that a full fixup for the situation as I
> described above not only gets rid of the problem for *both* single-
> and multi- threaded operation, but ends up with code which is faster
> than your patchset even for the case you are patching for.
>
> The multi-threaded stuff *is* very much relevant because it is
> increasingly more common (see below). I did not claim that
> single-threaded workloads don't matter.
>
> I would not be arguing here if there was no feasible way to handle
> both or if handling the multi-threaded case still resulted in
> measurable overhead for single-threaded workloads.
>
> Since you mention configure scripts, I'm intimately familiar with
> large-scale building as a workload. While it is true that there is
> rampant usage of shell, sed and whatnot (all of which are
> single-threaded), things turn multi-threaded (and short-lived) very
> quickly once you go past the gnu toolchain and/or c/c++ codebases.
>
> For example the llvm linker is multi-threaded and short-lived. Since
> most real programs are small, during a large scale build of different
> programs you end up with tons of lld spawning and quitting all the
> time.
>
> Beyond that java, erlang, zig and others like to multi-thread as well.
>
> Rust is an emerging ecosystem where people think adding threading
> equals automatically better performance and where crate authors think
> it's fine to sneak in threads (my favourite offender is the ctrlc
> crate). And since Rust is growing in popularity you can expect the
> kind of single-threaded tooling you see right now will turn
> multi-threaded from under you over time.
>
> > > The pragmatic way forward (as I see it anyway) is to fix up the
> > > multi-threaded thing and see if trying to special case for
> > > single-threaded case is justifiable afterwards.
> > >
> > > Given that the current patchset has to resort to atomics in certain
> > > cases, there is some error-pronnes and runtime overhead associated
> > > with it going beyond merely checking if the process is
> > > single-threaded, which puts an additional question mark on it.
> >
> > I don't get why atomics would make it error-prone.  But, regarding the
> > runtime overhead, please note the main point of this approach is that
> > the hot path can be handled with a simple non-atomic variable write in
> > the task context, and not the atomic operation. The later is only used
> > for infrequent case where the counter is touched by an external task
> > such as OOM, khugepaged, etc.
> >
>
> The claim is there may be a bug where something should be using the
> atomic codepath but is not.

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mateusz Guzik 2 months ago

On Wed, Dec 03, 2025 at 12:54:34PM +0100, Mateusz Guzik wrote:
> So I got another idea and it boils down to coalescing cid init with
> rss checks on exit.
> 
 
So short version is I implemented a POC and I have the same performance
for single-threaded processes as your patchset when testing on Sapphire
Rapids in an 80-way vm.

Caveats:
- there is a performance bug on the cpu vs rep movsb (see https://lore.kernel.org/all/mwwusvl7jllmck64xczeka42lglmsh7mlthuvmmqlmi5stp3na@raiwozh466wz/), I worked around it like so:
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index e20e25b8b16c..1b538f7bbd89 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -189,6 +189,29 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
     endif
 endif
 
+ifdef CONFIG_CC_IS_GCC
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
+# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
+# common denominator. Someone(tm) should revisit this from time to time.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+
 #
 # If the function graph tracer is used with mcount instead of fentry,
 # '-maccumulate-outgoing-args' is needed to prevent a GCC bug

- qemu version i'm saddled with does not pass FSRS to the guest, thus:
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index fb5a03cf5ab7..a692bb4cece4 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -30,7 +30,7 @@
  * which the compiler could/should do much better anyway.
  */
 SYM_TYPED_FUNC_START(__memset)
-       ALTERNATIVE "jmp memset_orig", "", X86_FEATURE_FSRS
+//     ALTERNATIVE "jmp memset_orig", "", X86_FEATURE_FSRS
 
        movq %rdi,%r9
        movb %sil,%al

Baseline commit (+ the 2 above hacks) is the following:
commit a8ec08bf32595ea4b109e3c7f679d4457d1c58c0
Merge: ed80cc758b78 48233291461b
Author: Vlastimil Babka <vbabka@suse.cz>
Date:   Tue Nov 25 14:38:41 2025 +0100

    Merge branch 'slab/for-6.19/mempool_alloc_bulk' into slab/for-next

This is what the ctor/dtor branch is rebased on. It is missing some of
the further changes to cid machinery in upstream, but they don't
fundamentally mess with the core idea of the patch (pcpu memory is still
allocated on mm creation and it is being zeroed) so I did not bother
rebasing -- end perf will be the same.

Benchmark is a static binary executing itself in a loop: http://apollo.backplane.com/DFlyMisc/doexec.c
    
$ cc -O2 -o static-doexec doexec.c
$ taskset --cpu-list 1 ./static-doexec 1

With ctor+dtor+unified walk I'm seeing 2% improvement over the baseline and the same performance as lazy counter.

If nobody is willing to productize this I'm going to do it.

non-production hack below for reference:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cb9c6b16c311..f952ec1f59d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1439,7 +1439,7 @@ static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
 	return (struct cpumask *)cid_bitmap;
 }
 
-static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
+static inline void mm_init_cid_percpu(struct mm_struct *mm, struct task_struct *p)
 {
 	int i;
 
@@ -1457,6 +1457,15 @@ static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
 	cpumask_clear(mm_cidmask(mm));
 }
 
+static inline void mm_init_cid(struct mm_struct *mm, struct task_struct *p)
+{
+	mm->nr_cpus_allowed = p->nr_cpus_allowed;
+	atomic_set(&mm->max_nr_cid, 0);
+	raw_spin_lock_init(&mm->cpus_allowed_lock);
+	cpumask_copy(mm_cpus_allowed(mm), &p->cpus_mask);
+	cpumask_clear(mm_cidmask(mm));
+}
+
 static inline int mm_alloc_cid_noprof(struct mm_struct *mm)
 {
 	mm->pcpu_cid = alloc_percpu_noprof(struct mm_cid);
diff --git a/kernel/fork.c b/kernel/fork.c
index a26319cddc3c..1575db9f0198 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -575,21 +575,46 @@ static inline int mm_alloc_id(struct mm_struct *mm) { return 0; }
 static inline void mm_free_id(struct mm_struct *mm) {}
 #endif /* CONFIG_MM_ID */
 
+/*
+ * pretend this is fully integrated into hotplug support
+ */
+__cacheline_aligned_in_smp DEFINE_SEQLOCK(cpu_hotplug_lock);
+
 static void check_mm(struct mm_struct *mm)
 {
-	int i;
+	long rss_stat[NR_MM_COUNTERS];
+	unsigned cpu_seq;
+	int i, cpu;
 
 	BUILD_BUG_ON_MSG(ARRAY_SIZE(resident_page_types) != NR_MM_COUNTERS,
 			 "Please make sure 'struct resident_page_types[]' is updated as well");
 
-	for (i = 0; i < NR_MM_COUNTERS; i++) {
-		long x = percpu_counter_sum(&mm->rss_stat[i]);
+	cpu_seq = read_seqbegin(&cpu_hotplug_lock);
+	local_irq_disable();
+	for (i = 0; i < NR_MM_COUNTERS; i++)
+		rss_stat[i] = mm->rss_stat[i].count;
+
+	for_each_possible_cpu(cpu) {
+		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu);
+
+		pcpu_cid->cid = MM_CID_UNSET;
+		pcpu_cid->recent_cid = MM_CID_UNSET;
+		pcpu_cid->time = 0;
 
-		if (unlikely(x)) {
+		for (i = 0; i < NR_MM_COUNTERS; i++)
+			rss_stat[i] += *per_cpu_ptr(mm->rss_stat[i].counters, cpu);
+	}
+	local_irq_enable();
+	if (read_seqretry(&cpu_hotplug_lock, cpu_seq))
+		BUG();
+
+	for (i = 0; i < NR_MM_COUNTERS; i++) {
+		if (unlikely(rss_stat[i])) {
 			pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n",
-				 mm, resident_page_types[i], x,
+				 mm, resident_page_types[i], rss_stat[i],
 				 current->comm,
 				 task_pid_nr(current));
+			/* XXXBUG: ZERO IT OUT */
 		}
 	}
 
@@ -2953,10 +2978,19 @@ static int sighand_ctor(void *data)
 static int mm_struct_ctor(void *object)
 {
 	struct mm_struct *mm = object;
+	int cpu;
 
 	if (mm_alloc_cid(mm))
 		return -ENOMEM;
 
+	for_each_possible_cpu(cpu) {
+		struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu);
+
+		pcpu_cid->cid = MM_CID_UNSET;
+		pcpu_cid->recent_cid = MM_CID_UNSET;
+		pcpu_cid->time = 0;
+	}
+
 	if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL,
 				     NR_MM_COUNTERS)) {
 		mm_destroy_cid(mm);
diff --git a/mm/percpu.c b/mm/percpu.c
index 7d036f42b5af..47e23ea90d7b 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1693,7 +1693,7 @@ static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size)
 
 	obj_cgroup_put(objcg);
 }
-bool pcpu_charge(void *ptr, size_t size, gfp_t gfp)
+bool pcpu_charge(void __percpu *ptr, size_t size, gfp_t gfp)
 {
 	struct obj_cgroup *objcg = NULL;
 	void *addr;
@@ -1710,7 +1710,7 @@ bool pcpu_charge(void *ptr, size_t size, gfp_t gfp)
 	return true;
 }
 
-void pcpu_uncharge(void *ptr, size_t size)
+void pcpu_uncharge(void __percpu *ptr, size_t size)
 {
 	void *addr;
 	struct pcpu_chunk *chunk;

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Harry Yoo 2 months, 1 week ago

On Mon, Dec 01, 2025 at 10:23:43AM -0500, Gabriel Krisman Bertazi wrote:
> Mateusz Guzik <mjguzik@gmail.com> writes:
> 
> > On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> >> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> >> > What would really reduce memory allocation overhead on fork
> >> > is to move all those fields into a top level
> >> > "struct mm_percpu_struct" as a first step. This would
> >> > merge 3 per-cpu allocations into one when forking a new
> >> > task.
> >> >
> >> > Then the second step is to create a mm_percpu_struct
> >> > cache to bypass the per-cpu allocator.
> >> >
> >> > I suspect that by doing just that we'd get most of the
> >> > performance benefits provided by the single-threaded special-case
> >> > proposed here.
> >>
> >> I don't think so. Because in the profiles I have been doing for these
> >> loads the biggest cost wasn't actually the per-cpu allocation itself but
> >> the cost of zeroing the allocated counter for many CPUs (and then the
> >> counter summarization on exit) and you're not going to get rid of that with
> >> just reshuffling per-cpu fields and adding slab allocator in front.
> >>
> 
> Hi Mateusz,
> 
> > The major claims (by me anyway) are:
> > 1. single-threaded operation for fork + exec suffers avoidable
> > overhead even without the rss counter problem, which are tractable
> > with the same kind of thing which would sort out the multi-threaded
> > problem
> 
> Agreed, there are more issues in the fork/exec path than just the
> rss_stat.  The rss_stat performance is particularly relevant to us,
> though, because it is a clear regression for single-threaded introduced
> in 6.2.
> 
> I took the time to test the slab constructor approach with the
> /sbin/true microbenchmark.  I've seen only 2% gain on that tight loop in
> the 80c machine, which, granted, is an artificial benchmark, but still a
> good stressor of the single-threaded case.  With this patchset, I
> reported 6% improvement, getting it close to the performance before the
> pcpu rss_stats introduction.

Hi Gabriel,

I don't want to argue which approach is better, but just wanted to
mention that maybe this is not a fair comparison because we can (almost)
eliminate initialization cost with slab ctor & dtor pair. As Mateusz
pointed out, under normal conditions, we know that the sum of
each rss_stat counter is zero when it's freed.

That is what slab constructor is for; if we know that certain fields of
a type are freed in a particular state, then we only need to initialize
them once in the constructor when the object is first created, and no
initialization is needed for subsequent allocations.

We couldn't use slab constructor to do this because percpu memory is not
allocated when it's called, but with ctor/dtor pair we can do this.

> This is expected, as avoiding the pcpu
> allocation and initialization all together for the single-threaded case,
> where it is not necessary, will always be better than speeding up the
> allocation (even though that a worthwhile effort itself, as Mathieu
> pointed out).

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Harry Yoo 2 months, 1 week ago

On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > > What would really reduce memory allocation overhead on fork
> > > is to move all those fields into a top level
> > > "struct mm_percpu_struct" as a first step. This would
> > > merge 3 per-cpu allocations into one when forking a new
> > > task.
> > >
> > > Then the second step is to create a mm_percpu_struct
> > > cache to bypass the per-cpu allocator.
> > >
> > > I suspect that by doing just that we'd get most of the
> > > performance benefits provided by the single-threaded special-case
> > > proposed here.
> >
> > I don't think so. Because in the profiles I have been doing for these
> > loads the biggest cost wasn't actually the per-cpu allocation itself but
> > the cost of zeroing the allocated counter for many CPUs (and then the
> > counter summarization on exit) and you're not going to get rid of that with
> > just reshuffling per-cpu fields and adding slab allocator in front.
> >
> 
> The entire ordeal has been discussed several times already. I'm rather
> disappointed there is a new patchset posted which does not address any
> of it and goes straight to special-casing single-threaded operation.
> 
> The major claims (by me anyway) are:
> 1. single-threaded operation for fork + exec suffers avoidable
> overhead even without the rss counter problem, which are tractable
> with the same kind of thing which would sort out the multi-threaded
> problem
> 2. unfortunately there is an increasing number of multi-threaded (and
> often short lived) processes (example: lld, the linker form the llvm
> project; more broadly plenty of things Rust where people think
> threading == performance)
> 
> Bottom line is, solutions like the one proposed in the patchset are at
> best a stopgap and even they leave performance on the table for the
> case they are optimizing for.
> 
> The pragmatic way forward (as I see it anyway) is to fix up the
> multi-threaded thing and see if trying to special case for
> single-threaded case is justifiable afterwards.
> 
> Given that the current patchset has to resort to atomics in certain
> cases, there is some error-pronnes and runtime overhead associated
> with it going beyond merely checking if the process is
> single-threaded, which puts an additional question mark on it.
> 
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
> 
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
>         for_each_possible_cpu(i) {
>                 struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
> 
>                 pcpu_cid->cid = MM_CID_UNSET;
>                 pcpu_cid->recent_cid = MM_CID_UNSET;
>                 pcpu_cid->time = 0;
>         }
> 
> There is no way this is not visible already on 256 threads.
> 
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
> 
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
> 
> So instead the code could reach out to the percpu allocator to
> allocate memory for both cid and rss (as mentined by Mathieu), but
> have it returned uninitialized and loop over it once sorting out both
> cid and rss in the same body. This should be drastically faster than
> the current code.
> 
> But one may observe it is an invariant the values sum up to 0 on process exit.
> 
> So if one was to make sure the first time this is handed out by the
> percpu allocator the values are all 0s and then cache the area
> somewhere for future allocs/frees of mm, there would be no need to do
> the zeroing on alloc.

That's what slab constructor is for!

> On the free side summing up rss counters in check_mm() is only there
> for debugging purposes. Suppose it is useful enough that it needs to
> stay. Even then, as implemented right now, this is just slow for no
> reason:
> 
>         for (i = 0; i < NR_MM_COUNTERS; i++) {
>                 long x = percpu_counter_sum(&mm->rss_stat[i]);
> [snip]
>         }
> 
> That's *four* loops with extra overhead of irq-trips for every single
> one. This can be patched up to only do one loop, possibly even with
> irqs enabled the entire time.
> 
> Doing the loop is still slower than not doing it, but his may be just
> fast enough to obsolete the ideas like in the proposed patchset.
> 
> While per-cpu level caching for all possible allocations seems like
> the easiest way out, it in fact does *NOT* fully solve problem -- you
> are still going to globally serialize in lru_gen_add_mm() (and the del
> part), pgd_alloc() and other places.
> 
> Or to put it differently, per-cpu caching of mm_struct itself makes no
> sense in the current kernel (with the patchset or not) because on the
> way to finish the alloc or free you are going to globally serialize
> several times and *that* is the issue to fix in the long run. You can
> make the problematic locks fine-grained (and consequently alleviate
> the scalability aspect), but you are still going to suffer the
> overhead of taking them.
> 
> As far as I'm concerned the real long term solution(tm) would make the
> cached mm's retain the expensive to sort out state -- list presence,
> percpu memory and whatever else.
> 
> To that end I see 2 feasible approaches:
> 1. a dedicated allocator with coarse granularity
> 
> Instead of per-cpu, you could have an instance for every n threads
> (let's say 8 or whatever). this would pose a tradeoff between total
> memory usage and scalability outside of a microbenchmark setting. you
> are still going to serialize in some cases, but only once on alloc and
> once on free, not several times and you are still cheaper
> single-threaded. This is faster all around.
> 
> 2. dtor support in the slub allocator
> 
> ctor does the hard work and dtor undoes it. There is an unfinished
> patchset by Harry which implements the idea[1].

Apologies for not reposting it for a while. I have limited capacity to push
this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
branch after rebasing it onto the latest slab/for-next.

https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads

My review on the version is limited, but did a little bit of testing.

> There is a serious concern about deadlock potential stemming from
> running arbitrary dtor code during memory reclaim. I already described
> elsewhere how with a little bit of discipline supported by lockdep
> this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
> take any locks if you hold them and you have to disable interrupts) +
> mark dtors as only allowed to hold a leaf spinlock et voila, code
> guaranteed to not deadlock). But then all code trying to cache its
> state in to be undone with dtor has to be patched to facilitate it.
> Again bugs in the area sorted out by lockdep.
> 
> The good news is that folks were apparently open to punting reclaim of
> such memory into a workqueue, which completely alleviates that concern
> anyway.

I took the good news and switched to using workqueue to reclaim slabs
(for caches with dtor) in v2.

> So happens if fork + exit is involved there are numerous other
> bottlenecks which overshadow the above, but that's a rant for another
> day. Here we can pretend for a minute they are solved.
> 
> [1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mateusz Guzik 2 months, 1 week ago

On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> Apologies for not reposting it for a while. I have limited capacity to push
> this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
> branch after rebasing it onto the latest slab/for-next.
>
> https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads
>

nice, thanks. This takes care of majority of the needful(tm).

To reiterate, should something like this land, it is going to address
the multicore scalability concern for single-threaded processes better
than the patchset by Gabriel thanks to also taking care of cid. Bonus
points for handling creation and teardown of multi-threaded processes.

However, this is still going to suffer from doing a full cpu walk on
process exit. As I described earlier the current handling can be
massively depessimized by reimplementing this to take care of all 4
counters in each iteration, instead of walking everything 4 times.
This is still going to be slower than not doing the walk at all, but
it may be fast enough that Gabriel's patchset is no longer
justifiable.

But then the test box is "only" 256 hw threads, what about bigger boxes?

Given my previous note about increased use of multithreading in
userspace, the more concerned you happen to be about such a walk, the
more you want an actual solution which takes care of multithreaded
processes.

Additionally one has to assume per-cpu memory will be useful for other
facilities down the line, making such a walk into an even bigger
problem.

Thus ultimately *some* tracking of whether given mm was ever active on
a given cpu is needed, preferably cheaply implemented at least for the
context switch code. Per what I described in another e-mail, one way
to do it would be to coalesce it with tlb handling by changing how the
bitmap tracking is handled -- having 2 adjacent bits denote cpu usage
+ tlb separately. For the common case this should be almost the code
to set the two. Iteration for tlb shootdowns would be less efficient
but that's probably tolerable. Maybe there is a better way, I did not
put much thought into it. I just claim sooner or later this will need
to get solved. At the same time would be a bummer to add stopgaps
without even trying.

With the cpu tracking problem solved, check_mm would visit few cpus in
the benchmark (probably just 1) and it would be faster single-threaded
than the proposed patch *and* would retain that for processes which
went multithreaded.

I'm not signing up to handle this though and someone else would have
to sign off on the cpu tracking thing anyway.

That is to say, I laid out the lay of the land as I see it but I'm not
doing any work. :)

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mathieu Desnoyers 2 months, 1 week ago

On 2025-12-01 06:31, Mateusz Guzik wrote:
> On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>> Apologies for not reposting it for a while. I have limited capacity to push
>> this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
>> branch after rebasing it onto the latest slab/for-next.
>>
>> https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads
>>
> 
> nice, thanks. This takes care of majority of the needful(tm).
> 
> To reiterate, should something like this land, it is going to address
> the multicore scalability concern for single-threaded processes better
> than the patchset by Gabriel thanks to also taking care of cid. Bonus
> points for handling creation and teardown of multi-threaded processes.
> 
> However, this is still going to suffer from doing a full cpu walk on
> process exit. As I described earlier the current handling can be
> massively depessimized by reimplementing this to take care of all 4
> counters in each iteration, instead of walking everything 4 times.
> This is still going to be slower than not doing the walk at all, but
> it may be fast enough that Gabriel's patchset is no longer
> justifiable.
> 
> But then the test box is "only" 256 hw threads, what about bigger boxes?
> 
> Given my previous note about increased use of multithreading in
> userspace, the more concerned you happen to be about such a walk, the
> more you want an actual solution which takes care of multithreaded
> processes.
> 
> Additionally one has to assume per-cpu memory will be useful for other
> facilities down the line, making such a walk into an even bigger
> problem.
> 
> Thus ultimately *some* tracking of whether given mm was ever active on
> a given cpu is needed, preferably cheaply implemented at least for the
> context switch code. Per what I described in another e-mail, one way
> to do it would be to coalesce it with tlb handling by changing how the
> bitmap tracking is handled -- having 2 adjacent bits denote cpu usage
> + tlb separately. For the common case this should be almost the code
> to set the two. Iteration for tlb shootdowns would be less efficient
> but that's probably tolerable. Maybe there is a better way, I did not
> put much thought into it. I just claim sooner or later this will need
> to get solved. At the same time would be a bummer to add stopgaps
> without even trying.
> 
> With the cpu tracking problem solved, check_mm would visit few cpus in
> the benchmark (probably just 1) and it would be faster single-threaded
> than the proposed patch *and* would retain that for processes which
> went multithreaded.
Looking at this problem, it appears to be a good fit for rseq mm_cid
(per-mm concurrency ids). Let me explain.

I originally implemented the rseq mm_cid for userspace. It keeps track
of max_mm_cid = min(nr_threads, nr_allowed_cpus) for each mm, and lets
the scheduler select a current mm_cid value within the range
[0 .. max_mm_cid - 1]. With Thomas Gleixner's rewrite (currently in
tip), we even have hooks in thread clone/exit where we know when
max_mm_cid is increased/decreased for a mm. So we could keep track of
the maximum value of max_mm_cid over the lifetime of a mm.

So using mm_cid for per-mm rss counter would involve:

- Still allocating memory per-cpu on mm allocation (nr_cpu_ids), but
   without zeroing all that memory (we eliminate a possible cpus walk on
   allocation).

- Initialize CPU counters on thread clone when max_mm_cid is increased.
   Keep track of the max value of max_mm_cid over mm lifetime.

- Rather than using the per-cpu accessors to access the counters, we
   would have to load the per-task mm_cid field to get the counter index.
   This would have a slight added overhead on the fast path, because we
   would change a segment-selector prefix operation for an access that
   depends on a load of the task struct current mm_cid index.

- Iteration on all possible cpus at process exit is replaced by an
   iteration on mm maximum max_mm_cid, which will be bound by
   the maximum value of min(nr_threads, nr_allowed_cpus) over the
   mm lifetime. This iteration should be done with the new mm_cid
   mutex held across thread clone/exit.

One more downside to consider is loss of NUMA locality, because the
index used to access the per-cpu memory would not take into account
the hardware topology. The index to topology should stay stable for
a given mm, but if we mix the memory allocation of per-cpu data
across different mm, then the NUMA locality would be degraded.
Ideally we'd have a per-cpu allocator with per-mm arenas for mm_cid
indexing if we care about NUMA locality.

So let's say you have a 256-core machine, where cpu numbers can go
from 0 to 255, with a 4-thread process, mm_cid will be limited to
the range [0..3]. Likewise if there are tons of threads in a process
limited to a few cores (e.g. pinned on cores from 10 to 19), which
will limit the range to [0..9].

This approach solves the runtime overhead issue of zeroing per-cpu
memory for all scenarios:

* single-threaded: index = 0

* nr_threads < nr_cpu_ids
   * nr_threads < nr_allowed_cpus: index = [0 .. nr_threads - 1]
   * nr_threads >= nr_allowed_cpus: index = [0 .. nr_allowed_cpus - 1]

* nr_threads >= nr_cpus_ids: index = [0 .. nr_cpu_ids - 1]

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mateusz Guzik 2 months, 1 week ago

On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
> 
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
>         for_each_possible_cpu(i) {
>                 struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
> 
>                 pcpu_cid->cid = MM_CID_UNSET;
>                 pcpu_cid->recent_cid = MM_CID_UNSET;
>                 pcpu_cid->time = 0;
>         }
> 
> There is no way this is not visible already on 256 threads.
> 
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
> 
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
> 

So I had another look and I think bitmapping it is perfectly feasible,
albeit requiring a little bit of refactoring to avoid adding overhead in
the common case.

There is a bitmap for tlb tracking, updated like so on context switch in
switch_mm_irqs_off():

	if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
		cpumask_set_cpu(cpu, mm_cpumask(next));

... and of course cleared at times.

Easiest way out would add an additional bitmap with bits which are
*never* cleared. But that's another cache miss, preferably avoided.

Instead the entire thing could be reimplemented to have 2 bits per CPU
in the bitmap -- one for tlb and another for ever running on it.

Having spotted you are running on the given cpu for the first time, the
rss area gets zeroed out and *both* bits get set et voila. The common
case gets away with the same load as always. The less common case gets
more work of having to zero the counters initialize cid.

In return both cid and rss handling can avoid mandatory linear walks by
cpu count, instead merely having to visit the cpus known to have used a
given mm.

I don't think this is particularly ugly or complicated, just needs some
care & time to sit through and refactor away all the direct access into
helpers.

So if I was tasked with working on the overall problem, I would
definitely try to get this done. Fortunately for me this is not the
case. :-)

Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

Posted by Mathieu Desnoyers 2 months, 1 week ago

On 2025-11-28 15:10, Jan Kara wrote:
> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
[...]
>> I suspect that by doing just that we'd get most of the
>> performance benefits provided by the single-threaded special-case
>> proposed here.
> 
> I don't think so. Because in the profiles I have been doing for these
> loads the biggest cost wasn't actually the per-cpu allocation itself but
> the cost of zeroing the allocated counter for many CPUs (and then the
> counter summarization on exit) and you're not going to get rid of that with
> just reshuffling per-cpu fields and adding slab allocator in front.

That's a good point ! So skipping the zeroing of per-cpu fields would
indeed justify special-casing the single-threaded case.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com