arch/s390/mm/gmap_helpers.c | 4 +- arch/s390/mm/pgtable.c | 4 +- fs/exec.c | 2 +- include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++ include/linux/mm.h | 26 ++--- include/linux/mm_types.h | 4 +- include/linux/percpu_counter.h | 5 +- include/trace/events/kmem.h | 4 +- kernel/events/uprobes.c | 2 +- kernel/fork.c | 14 ++- lib/percpu_counter.c | 68 ++++++++++--- mm/filemap.c | 2 +- mm/huge_memory.c | 22 ++--- mm/khugepaged.c | 6 +- mm/ksm.c | 2 +- mm/madvise.c | 2 +- mm/memory.c | 20 ++-- mm/migrate.c | 2 +- mm/migrate_device.c | 2 +- mm/rmap.c | 16 +-- mm/swapfile.c | 6 +- mm/userfaultfd.c | 2 +- 22 files changed, 276 insertions(+), 84 deletions(-) create mode 100644 include/linux/lazy_percpu_counter.h
The cost of the pcpu memory allocation is non-negligible for systems with many cpus, and it is quite visible when forking a new task, as reported in a few occasions. In particular, Jan Kara reported the commit introducing per-cpu counters for rss_stat caused a 10% regression of system time for gitsource in his system [1]. In that same occasion, Jan suggested we special-cased the single-threaded case: since we know there won't be frequent remote updates of rss_stats for single-threaded applications, we could special case it with a local counter for most updates, and an atomic counter for the infrequent remote updates. This patchset implements this idea. It exposes a dual-mode counter that starts as a simple counter, cheap to initialize on single-threaded tasks, that can be upgraded inflight to a fully-fledged per cpu counter later. Patch 3 then modifies the rss_stat counters to use that structure, forcing the upgrade as soon as a second task sharing the mm_struct is spawned. By delaying the initialization cost until the MM is shared, we cover single-threaded applications fairly cheaply, while not penalizing applications that spawn multiple threads. On a 256c system, where the pcpu allocation of the rss_stats is quite noticeable, this has reduced the wall-clock time between 6% 15% (depending on the number of cores) of an artificial fork-intensive microbenchmark (calling /bin/true in a loop). In a more realistic benchmark, it showed an improvement of 1.5% on kernbench elapsed time. More performance data, including profilings is available in the patch modifying the rss_stat counters. While this patch exposes a single users of this API, this should be useful in more cases. This is why I made it into a proper API. In addition, considering the recent efforts in this area, such as hierarchical per-cpu counters which are orthogonal to this work because they improve multi-threaded workloads, abstracting this with a new API could help the merging of both works. Finally, this is a RFC because it is an early work. in particular, I'd be interested in more benchmarks suggestions, and I'd like feedback whether this new interface should be implemented inside percpu_counters as lazy counters or as a completely separated interface. Thanks, [1] https://lore.kernel.org/all/20230608111408.s2minsenlcjow7q3@quack3 --- Cc: linux-kernel@vger.kernel.org Cc: jack@suse.cz Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@gentwo.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Gabriel Krisman Bertazi (4): lib/percpu_counter: Split out a helper to insert into hotplug list lib: Support lazy initialization of per-cpu counters mm: Avoid percpu MM counters on single-threaded tasks mm: Split a slow path for updating mm counters arch/s390/mm/gmap_helpers.c | 4 +- arch/s390/mm/pgtable.c | 4 +- fs/exec.c | 2 +- include/linux/lazy_percpu_counter.h | 145 ++++++++++++++++++++++++++++ include/linux/mm.h | 26 ++--- include/linux/mm_types.h | 4 +- include/linux/percpu_counter.h | 5 +- include/trace/events/kmem.h | 4 +- kernel/events/uprobes.c | 2 +- kernel/fork.c | 14 ++- lib/percpu_counter.c | 68 ++++++++++--- mm/filemap.c | 2 +- mm/huge_memory.c | 22 ++--- mm/khugepaged.c | 6 +- mm/ksm.c | 2 +- mm/madvise.c | 2 +- mm/memory.c | 20 ++-- mm/migrate.c | 2 +- mm/migrate_device.c | 2 +- mm/rmap.c | 16 +-- mm/swapfile.c | 6 +- mm/userfaultfd.c | 2 +- 22 files changed, 276 insertions(+), 84 deletions(-) create mode 100644 include/linux/lazy_percpu_counter.h -- 2.51.0
On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote: > The cost of the pcpu memory allocation is non-negligible for systems > with many cpus, and it is quite visible when forking a new task, as > reported in a few occasions. I've come to the same conclusion within the development of the hierarchical per-cpu counters. But while the mm_struct has a SLAB cache (initialized in kernel/fork.c:mm_cache_init()), there is no such thing for the per-mm per-cpu data. In the mm_struct, we have the following per-cpu data (please let me know if I missed any in the maze): - struct mm_cid __percpu *pcpu_cid (or equivalent through struct mm_mm_cid after Thomas Gleixner gets his rewrite upstream), - unsigned int __percpu *futex_ref, - NR_MM_COUNTERS rss_stats per-cpu counters. What would really reduce memory allocation overhead on fork is to move all those fields into a top level "struct mm_percpu_struct" as a first step. This would merge 3 per-cpu allocations into one when forking a new task. Then the second step is to create a mm_percpu_struct cache to bypass the per-cpu allocator. I suspect that by doing just that we'd get most of the performance benefits provided by the single-threaded special-case proposed here. I'm not against special casing single-threaded if it's still worth it after doing the underlying data structure layout/caching changes I'm proposing here, but I think we need to fix the memory allocation overhead issue first before working around it with special cases and added complexity. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: > On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote: > > The cost of the pcpu memory allocation is non-negligible for systems > > with many cpus, and it is quite visible when forking a new task, as > > reported in a few occasions. > I've come to the same conclusion within the development of > the hierarchical per-cpu counters. > > But while the mm_struct has a SLAB cache (initialized in > kernel/fork.c:mm_cache_init()), there is no such thing > for the per-mm per-cpu data. > > In the mm_struct, we have the following per-cpu data (please > let me know if I missed any in the maze): > > - struct mm_cid __percpu *pcpu_cid (or equivalent through > struct mm_mm_cid after Thomas Gleixner gets his rewrite > upstream), > > - unsigned int __percpu *futex_ref, > > - NR_MM_COUNTERS rss_stats per-cpu counters. > > What would really reduce memory allocation overhead on fork > is to move all those fields into a top level > "struct mm_percpu_struct" as a first step. This would > merge 3 per-cpu allocations into one when forking a new > task. > > Then the second step is to create a mm_percpu_struct > cache to bypass the per-cpu allocator. > > I suspect that by doing just that we'd get most of the > performance benefits provided by the single-threaded special-case > proposed here. I don't think so. Because in the profiles I have been doing for these loads the biggest cost wasn't actually the per-cpu allocation itself but the cost of zeroing the allocated counter for many CPUs (and then the counter summarization on exit) and you're not going to get rid of that with just reshuffling per-cpu fields and adding slab allocator in front. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR
On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > What would really reduce memory allocation overhead on fork
> > is to move all those fields into a top level
> > "struct mm_percpu_struct" as a first step. This would
> > merge 3 per-cpu allocations into one when forking a new
> > task.
> >
> > Then the second step is to create a mm_percpu_struct
> > cache to bypass the per-cpu allocator.
> >
> > I suspect that by doing just that we'd get most of the
> > performance benefits provided by the single-threaded special-case
> > proposed here.
>
> I don't think so. Because in the profiles I have been doing for these
> loads the biggest cost wasn't actually the per-cpu allocation itself but
> the cost of zeroing the allocated counter for many CPUs (and then the
> counter summarization on exit) and you're not going to get rid of that with
> just reshuffling per-cpu fields and adding slab allocator in front.
>
The entire ordeal has been discussed several times already. I'm rather
disappointed there is a new patchset posted which does not address any
of it and goes straight to special-casing single-threaded operation.
The major claims (by me anyway) are:
1. single-threaded operation for fork + exec suffers avoidable
overhead even without the rss counter problem, which are tractable
with the same kind of thing which would sort out the multi-threaded
problem
2. unfortunately there is an increasing number of multi-threaded (and
often short lived) processes (example: lld, the linker form the llvm
project; more broadly plenty of things Rust where people think
threading == performance)
Bottom line is, solutions like the one proposed in the patchset are at
best a stopgap and even they leave performance on the table for the
case they are optimizing for.
The pragmatic way forward (as I see it anyway) is to fix up the
multi-threaded thing and see if trying to special case for
single-threaded case is justifiable afterwards.
Given that the current patchset has to resort to atomics in certain
cases, there is some error-pronnes and runtime overhead associated
with it going beyond merely checking if the process is
single-threaded, which puts an additional question mark on it.
Now to business:
You mentioned the rss loops are a problem. I agree, but they can be
largely damage-controlled. More importantly there are 2 loops of the
sort already happening even with the patchset at hand.
mm_alloc_cid() results in one loop in the percpu allocator to zero out
the area, then mm_init_cid() performs the following:
for_each_possible_cpu(i) {
struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
pcpu_cid->cid = MM_CID_UNSET;
pcpu_cid->recent_cid = MM_CID_UNSET;
pcpu_cid->time = 0;
}
There is no way this is not visible already on 256 threads.
Preferably some magic would be done to init this on first use on given
CPU.There is some bitmap tracking CPU presence, maybe this can be
tackled on top of it. But for the sake of argument let's say that's
too expensive or perhaps not feasible. Even then, the walk can be done
*once* by telling the percpu allocator to refrain from zeroing memory.
Which brings me to rss counters. In the current kernel that's
*another* loop over everything to zero it out. But it does not have to
be that way. Suppose bitmap shenanigans mentioned above are no-go for
these as well.
So instead the code could reach out to the percpu allocator to
allocate memory for both cid and rss (as mentined by Mathieu), but
have it returned uninitialized and loop over it once sorting out both
cid and rss in the same body. This should be drastically faster than
the current code.
But one may observe it is an invariant the values sum up to 0 on process exit.
So if one was to make sure the first time this is handed out by the
percpu allocator the values are all 0s and then cache the area
somewhere for future allocs/frees of mm, there would be no need to do
the zeroing on alloc.
On the free side summing up rss counters in check_mm() is only there
for debugging purposes. Suppose it is useful enough that it needs to
stay. Even then, as implemented right now, this is just slow for no
reason:
for (i = 0; i < NR_MM_COUNTERS; i++) {
long x = percpu_counter_sum(&mm->rss_stat[i]);
[snip]
}
That's *four* loops with extra overhead of irq-trips for every single
one. This can be patched up to only do one loop, possibly even with
irqs enabled the entire time.
Doing the loop is still slower than not doing it, but his may be just
fast enough to obsolete the ideas like in the proposed patchset.
While per-cpu level caching for all possible allocations seems like
the easiest way out, it in fact does *NOT* fully solve problem -- you
are still going to globally serialize in lru_gen_add_mm() (and the del
part), pgd_alloc() and other places.
Or to put it differently, per-cpu caching of mm_struct itself makes no
sense in the current kernel (with the patchset or not) because on the
way to finish the alloc or free you are going to globally serialize
several times and *that* is the issue to fix in the long run. You can
make the problematic locks fine-grained (and consequently alleviate
the scalability aspect), but you are still going to suffer the
overhead of taking them.
As far as I'm concerned the real long term solution(tm) would make the
cached mm's retain the expensive to sort out state -- list presence,
percpu memory and whatever else.
To that end I see 2 feasible approaches:
1. a dedicated allocator with coarse granularity
Instead of per-cpu, you could have an instance for every n threads
(let's say 8 or whatever). this would pose a tradeoff between total
memory usage and scalability outside of a microbenchmark setting. you
are still going to serialize in some cases, but only once on alloc and
once on free, not several times and you are still cheaper
single-threaded. This is faster all around.
2. dtor support in the slub allocator
ctor does the hard work and dtor undoes it. There is an unfinished
patchset by Harry which implements the idea[1].
There is a serious concern about deadlock potential stemming from
running arbitrary dtor code during memory reclaim. I already described
elsewhere how with a little bit of discipline supported by lockdep
this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
take any locks if you hold them and you have to disable interrupts) +
mark dtors as only allowed to hold a leaf spinlock et voila, code
guaranteed to not deadlock). But then all code trying to cache its
state in to be undone with dtor has to be patched to facilitate it.
Again bugs in the area sorted out by lockdep.
The good news is that folks were apparently open to punting reclaim of
such memory into a workqueue, which completely alleviates that concern
anyway.
So happens if fork + exit is involved there are numerous other
bottlenecks which overshadow the above, but that's a rant for another
day. Here we can pretend for a minute they are solved.
[1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads
Mateusz Guzik <mjguzik@gmail.com> writes: > On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote: >> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: >> > What would really reduce memory allocation overhead on fork >> > is to move all those fields into a top level >> > "struct mm_percpu_struct" as a first step. This would >> > merge 3 per-cpu allocations into one when forking a new >> > task. >> > >> > Then the second step is to create a mm_percpu_struct >> > cache to bypass the per-cpu allocator. >> > >> > I suspect that by doing just that we'd get most of the >> > performance benefits provided by the single-threaded special-case >> > proposed here. >> >> I don't think so. Because in the profiles I have been doing for these >> loads the biggest cost wasn't actually the per-cpu allocation itself but >> the cost of zeroing the allocated counter for many CPUs (and then the >> counter summarization on exit) and you're not going to get rid of that with >> just reshuffling per-cpu fields and adding slab allocator in front. >> Hi Mateusz, > The major claims (by me anyway) are: > 1. single-threaded operation for fork + exec suffers avoidable > overhead even without the rss counter problem, which are tractable > with the same kind of thing which would sort out the multi-threaded > problem Agreed, there are more issues in the fork/exec path than just the rss_stat. The rss_stat performance is particularly relevant to us, though, because it is a clear regression for single-threaded introduced in 6.2. I took the time to test the slab constructor approach with the /sbin/true microbenchmark. I've seen only 2% gain on that tight loop in the 80c machine, which, granted, is an artificial benchmark, but still a good stressor of the single-threaded case. With this patchset, I reported 6% improvement, getting it close to the performance before the pcpu rss_stats introduction. This is expected, as avoiding the pcpu allocation and initialization all together for the single-threaded case, where it is not necessary, will always be better than speeding up the allocation (even though that a worthwhile effort itself, as Mathieu pointed out). > 2. unfortunately there is an increasing number of multi-threaded (and > often short lived) processes (example: lld, the linker form the llvm > project; more broadly plenty of things Rust where people think > threading == performance) I don't agree with this argument, though. Sure, there is an increasing amount of multi-threaded applications, but this is not relevant. The relevant argument is the amount of single-threaded workloads. One example are coreutils, which are spawned to death by scripts. I did take the care of testing the patchset with a full distro on my day-to-day laptop and I wasn't surprised to see the vast majority of forked tasks never fork a second thread. The ones that do are most often long-lived applications, where the cost of mm initialization is way less relevant to the overall system performance. Another example is the fact real-world benchmarks, like kernbench, can be improved with special-casing single-threads. > The pragmatic way forward (as I see it anyway) is to fix up the > multi-threaded thing and see if trying to special case for > single-threaded case is justifiable afterwards. > > Given that the current patchset has to resort to atomics in certain > cases, there is some error-pronnes and runtime overhead associated > with it going beyond merely checking if the process is > single-threaded, which puts an additional question mark on it. I don't get why atomics would make it error-prone. But, regarding the runtime overhead, please note the main point of this approach is that the hot path can be handled with a simple non-atomic variable write in the task context, and not the atomic operation. The later is only used for infrequent case where the counter is touched by an external task such as OOM, khugepaged, etc. > > Now to business: -- Gabriel Krisman Bertazi
On Mon, Dec 01, 2025 at 10:23:43AM -0500, Gabriel Krisman Bertazi wrote: > Mateusz Guzik <mjguzik@gmail.com> writes: > > > On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote: > >> On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: > >> > What would really reduce memory allocation overhead on fork > >> > is to move all those fields into a top level > >> > "struct mm_percpu_struct" as a first step. This would > >> > merge 3 per-cpu allocations into one when forking a new > >> > task. > >> > > >> > Then the second step is to create a mm_percpu_struct > >> > cache to bypass the per-cpu allocator. > >> > > >> > I suspect that by doing just that we'd get most of the > >> > performance benefits provided by the single-threaded special-case > >> > proposed here. > >> > >> I don't think so. Because in the profiles I have been doing for these > >> loads the biggest cost wasn't actually the per-cpu allocation itself but > >> the cost of zeroing the allocated counter for many CPUs (and then the > >> counter summarization on exit) and you're not going to get rid of that with > >> just reshuffling per-cpu fields and adding slab allocator in front. > >> > > Hi Mateusz, > > > The major claims (by me anyway) are: > > 1. single-threaded operation for fork + exec suffers avoidable > > overhead even without the rss counter problem, which are tractable > > with the same kind of thing which would sort out the multi-threaded > > problem > > Agreed, there are more issues in the fork/exec path than just the > rss_stat. The rss_stat performance is particularly relevant to us, > though, because it is a clear regression for single-threaded introduced > in 6.2. > > I took the time to test the slab constructor approach with the > /sbin/true microbenchmark. I've seen only 2% gain on that tight loop in > the 80c machine, which, granted, is an artificial benchmark, but still a > good stressor of the single-threaded case. With this patchset, I > reported 6% improvement, getting it close to the performance before the > pcpu rss_stats introduction. Hi Gabriel, I don't want to argue which approach is better, but just wanted to mention that maybe this is not a fair comparison because we can (almost) eliminate initialization cost with slab ctor & dtor pair. As Mateusz pointed out, under normal conditions, we know that the sum of each rss_stat counter is zero when it's freed. That is what slab constructor is for; if we know that certain fields of a type are freed in a particular state, then we only need to initialize them once in the constructor when the object is first created, and no initialization is needed for subsequent allocations. We couldn't use slab constructor to do this because percpu memory is not allocated when it's called, but with ctor/dtor pair we can do this. > This is expected, as avoiding the pcpu > allocation and initialization all together for the single-threaded case, > where it is not necessary, will always be better than speeding up the > allocation (even though that a worthwhile effort itself, as Mathieu > pointed out). -- Cheers, Harry / Hyeonggon
On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> On Fri, Nov 28, 2025 at 9:10 PM Jan Kara <jack@suse.cz> wrote:
> > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote:
> > > What would really reduce memory allocation overhead on fork
> > > is to move all those fields into a top level
> > > "struct mm_percpu_struct" as a first step. This would
> > > merge 3 per-cpu allocations into one when forking a new
> > > task.
> > >
> > > Then the second step is to create a mm_percpu_struct
> > > cache to bypass the per-cpu allocator.
> > >
> > > I suspect that by doing just that we'd get most of the
> > > performance benefits provided by the single-threaded special-case
> > > proposed here.
> >
> > I don't think so. Because in the profiles I have been doing for these
> > loads the biggest cost wasn't actually the per-cpu allocation itself but
> > the cost of zeroing the allocated counter for many CPUs (and then the
> > counter summarization on exit) and you're not going to get rid of that with
> > just reshuffling per-cpu fields and adding slab allocator in front.
> >
>
> The entire ordeal has been discussed several times already. I'm rather
> disappointed there is a new patchset posted which does not address any
> of it and goes straight to special-casing single-threaded operation.
>
> The major claims (by me anyway) are:
> 1. single-threaded operation for fork + exec suffers avoidable
> overhead even without the rss counter problem, which are tractable
> with the same kind of thing which would sort out the multi-threaded
> problem
> 2. unfortunately there is an increasing number of multi-threaded (and
> often short lived) processes (example: lld, the linker form the llvm
> project; more broadly plenty of things Rust where people think
> threading == performance)
>
> Bottom line is, solutions like the one proposed in the patchset are at
> best a stopgap and even they leave performance on the table for the
> case they are optimizing for.
>
> The pragmatic way forward (as I see it anyway) is to fix up the
> multi-threaded thing and see if trying to special case for
> single-threaded case is justifiable afterwards.
>
> Given that the current patchset has to resort to atomics in certain
> cases, there is some error-pronnes and runtime overhead associated
> with it going beyond merely checking if the process is
> single-threaded, which puts an additional question mark on it.
>
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
>
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
> for_each_possible_cpu(i) {
> struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
>
> pcpu_cid->cid = MM_CID_UNSET;
> pcpu_cid->recent_cid = MM_CID_UNSET;
> pcpu_cid->time = 0;
> }
>
> There is no way this is not visible already on 256 threads.
>
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
>
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
>
> So instead the code could reach out to the percpu allocator to
> allocate memory for both cid and rss (as mentined by Mathieu), but
> have it returned uninitialized and loop over it once sorting out both
> cid and rss in the same body. This should be drastically faster than
> the current code.
>
> But one may observe it is an invariant the values sum up to 0 on process exit.
>
> So if one was to make sure the first time this is handed out by the
> percpu allocator the values are all 0s and then cache the area
> somewhere for future allocs/frees of mm, there would be no need to do
> the zeroing on alloc.
That's what slab constructor is for!
> On the free side summing up rss counters in check_mm() is only there
> for debugging purposes. Suppose it is useful enough that it needs to
> stay. Even then, as implemented right now, this is just slow for no
> reason:
>
> for (i = 0; i < NR_MM_COUNTERS; i++) {
> long x = percpu_counter_sum(&mm->rss_stat[i]);
> [snip]
> }
>
> That's *four* loops with extra overhead of irq-trips for every single
> one. This can be patched up to only do one loop, possibly even with
> irqs enabled the entire time.
>
> Doing the loop is still slower than not doing it, but his may be just
> fast enough to obsolete the ideas like in the proposed patchset.
>
> While per-cpu level caching for all possible allocations seems like
> the easiest way out, it in fact does *NOT* fully solve problem -- you
> are still going to globally serialize in lru_gen_add_mm() (and the del
> part), pgd_alloc() and other places.
>
> Or to put it differently, per-cpu caching of mm_struct itself makes no
> sense in the current kernel (with the patchset or not) because on the
> way to finish the alloc or free you are going to globally serialize
> several times and *that* is the issue to fix in the long run. You can
> make the problematic locks fine-grained (and consequently alleviate
> the scalability aspect), but you are still going to suffer the
> overhead of taking them.
>
> As far as I'm concerned the real long term solution(tm) would make the
> cached mm's retain the expensive to sort out state -- list presence,
> percpu memory and whatever else.
>
> To that end I see 2 feasible approaches:
> 1. a dedicated allocator with coarse granularity
>
> Instead of per-cpu, you could have an instance for every n threads
> (let's say 8 or whatever). this would pose a tradeoff between total
> memory usage and scalability outside of a microbenchmark setting. you
> are still going to serialize in some cases, but only once on alloc and
> once on free, not several times and you are still cheaper
> single-threaded. This is faster all around.
>
> 2. dtor support in the slub allocator
>
> ctor does the hard work and dtor undoes it. There is an unfinished
> patchset by Harry which implements the idea[1].
Apologies for not reposting it for a while. I have limited capacity to push
this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
branch after rebasing it onto the latest slab/for-next.
https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads
My review on the version is limited, but did a little bit of testing.
> There is a serious concern about deadlock potential stemming from
> running arbitrary dtor code during memory reclaim. I already described
> elsewhere how with a little bit of discipline supported by lockdep
> this is a non-issue (tl;dr add spinlocks marked as "leaf" (you can't
> take any locks if you hold them and you have to disable interrupts) +
> mark dtors as only allowed to hold a leaf spinlock et voila, code
> guaranteed to not deadlock). But then all code trying to cache its
> state in to be undone with dtor has to be patched to facilitate it.
> Again bugs in the area sorted out by lockdep.
>
> The good news is that folks were apparently open to punting reclaim of
> such memory into a workqueue, which completely alleviates that concern
> anyway.
I took the good news and switched to using workqueue to reclaim slabs
(for caches with dtor) in v2.
> So happens if fork + exit is involved there are numerous other
> bottlenecks which overshadow the above, but that's a rant for another
> day. Here we can pretend for a minute they are solved.
>
> [1] https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2-wip?ref_type=heads
--
Cheers,
Harry / Hyeonggon
On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote: > Apologies for not reposting it for a while. I have limited capacity to push > this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip > branch after rebasing it onto the latest slab/for-next. > > https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads > nice, thanks. This takes care of majority of the needful(tm). To reiterate, should something like this land, it is going to address the multicore scalability concern for single-threaded processes better than the patchset by Gabriel thanks to also taking care of cid. Bonus points for handling creation and teardown of multi-threaded processes. However, this is still going to suffer from doing a full cpu walk on process exit. As I described earlier the current handling can be massively depessimized by reimplementing this to take care of all 4 counters in each iteration, instead of walking everything 4 times. This is still going to be slower than not doing the walk at all, but it may be fast enough that Gabriel's patchset is no longer justifiable. But then the test box is "only" 256 hw threads, what about bigger boxes? Given my previous note about increased use of multithreading in userspace, the more concerned you happen to be about such a walk, the more you want an actual solution which takes care of multithreaded processes. Additionally one has to assume per-cpu memory will be useful for other facilities down the line, making such a walk into an even bigger problem. Thus ultimately *some* tracking of whether given mm was ever active on a given cpu is needed, preferably cheaply implemented at least for the context switch code. Per what I described in another e-mail, one way to do it would be to coalesce it with tlb handling by changing how the bitmap tracking is handled -- having 2 adjacent bits denote cpu usage + tlb separately. For the common case this should be almost the code to set the two. Iteration for tlb shootdowns would be less efficient but that's probably tolerable. Maybe there is a better way, I did not put much thought into it. I just claim sooner or later this will need to get solved. At the same time would be a bummer to add stopgaps without even trying. With the cpu tracking problem solved, check_mm would visit few cpus in the benchmark (probably just 1) and it would be faster single-threaded than the proposed patch *and* would retain that for processes which went multithreaded. I'm not signing up to handle this though and someone else would have to sign off on the cpu tracking thing anyway. That is to say, I laid out the lay of the land as I see it but I'm not doing any work. :)
On 2025-12-01 06:31, Mateusz Guzik wrote: > On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@oracle.com> wrote: >> Apologies for not reposting it for a while. I have limited capacity to push >> this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip >> branch after rebasing it onto the latest slab/for-next. >> >> https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads >> > > nice, thanks. This takes care of majority of the needful(tm). > > To reiterate, should something like this land, it is going to address > the multicore scalability concern for single-threaded processes better > than the patchset by Gabriel thanks to also taking care of cid. Bonus > points for handling creation and teardown of multi-threaded processes. > > However, this is still going to suffer from doing a full cpu walk on > process exit. As I described earlier the current handling can be > massively depessimized by reimplementing this to take care of all 4 > counters in each iteration, instead of walking everything 4 times. > This is still going to be slower than not doing the walk at all, but > it may be fast enough that Gabriel's patchset is no longer > justifiable. > > But then the test box is "only" 256 hw threads, what about bigger boxes? > > Given my previous note about increased use of multithreading in > userspace, the more concerned you happen to be about such a walk, the > more you want an actual solution which takes care of multithreaded > processes. > > Additionally one has to assume per-cpu memory will be useful for other > facilities down the line, making such a walk into an even bigger > problem. > > Thus ultimately *some* tracking of whether given mm was ever active on > a given cpu is needed, preferably cheaply implemented at least for the > context switch code. Per what I described in another e-mail, one way > to do it would be to coalesce it with tlb handling by changing how the > bitmap tracking is handled -- having 2 adjacent bits denote cpu usage > + tlb separately. For the common case this should be almost the code > to set the two. Iteration for tlb shootdowns would be less efficient > but that's probably tolerable. Maybe there is a better way, I did not > put much thought into it. I just claim sooner or later this will need > to get solved. At the same time would be a bummer to add stopgaps > without even trying. > > With the cpu tracking problem solved, check_mm would visit few cpus in > the benchmark (probably just 1) and it would be faster single-threaded > than the proposed patch *and* would retain that for processes which > went multithreaded. Looking at this problem, it appears to be a good fit for rseq mm_cid (per-mm concurrency ids). Let me explain. I originally implemented the rseq mm_cid for userspace. It keeps track of max_mm_cid = min(nr_threads, nr_allowed_cpus) for each mm, and lets the scheduler select a current mm_cid value within the range [0 .. max_mm_cid - 1]. With Thomas Gleixner's rewrite (currently in tip), we even have hooks in thread clone/exit where we know when max_mm_cid is increased/decreased for a mm. So we could keep track of the maximum value of max_mm_cid over the lifetime of a mm. So using mm_cid for per-mm rss counter would involve: - Still allocating memory per-cpu on mm allocation (nr_cpu_ids), but without zeroing all that memory (we eliminate a possible cpus walk on allocation). - Initialize CPU counters on thread clone when max_mm_cid is increased. Keep track of the max value of max_mm_cid over mm lifetime. - Rather than using the per-cpu accessors to access the counters, we would have to load the per-task mm_cid field to get the counter index. This would have a slight added overhead on the fast path, because we would change a segment-selector prefix operation for an access that depends on a load of the task struct current mm_cid index. - Iteration on all possible cpus at process exit is replaced by an iteration on mm maximum max_mm_cid, which will be bound by the maximum value of min(nr_threads, nr_allowed_cpus) over the mm lifetime. This iteration should be done with the new mm_cid mutex held across thread clone/exit. One more downside to consider is loss of NUMA locality, because the index used to access the per-cpu memory would not take into account the hardware topology. The index to topology should stay stable for a given mm, but if we mix the memory allocation of per-cpu data across different mm, then the NUMA locality would be degraded. Ideally we'd have a per-cpu allocator with per-mm arenas for mm_cid indexing if we care about NUMA locality. So let's say you have a 256-core machine, where cpu numbers can go from 0 to 255, with a 4-thread process, mm_cid will be limited to the range [0..3]. Likewise if there are tons of threads in a process limited to a few cores (e.g. pinned on cores from 10 to 19), which will limit the range to [0..9]. This approach solves the runtime overhead issue of zeroing per-cpu memory for all scenarios: * single-threaded: index = 0 * nr_threads < nr_cpu_ids * nr_threads < nr_allowed_cpus: index = [0 .. nr_threads - 1] * nr_threads >= nr_allowed_cpus: index = [0 .. nr_allowed_cpus - 1] * nr_threads >= nr_cpus_ids: index = [0 .. nr_cpu_ids - 1] Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Sat, Nov 29, 2025 at 06:57:21AM +0100, Mateusz Guzik wrote:
> Now to business:
> You mentioned the rss loops are a problem. I agree, but they can be
> largely damage-controlled. More importantly there are 2 loops of the
> sort already happening even with the patchset at hand.
>
> mm_alloc_cid() results in one loop in the percpu allocator to zero out
> the area, then mm_init_cid() performs the following:
> for_each_possible_cpu(i) {
> struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
>
> pcpu_cid->cid = MM_CID_UNSET;
> pcpu_cid->recent_cid = MM_CID_UNSET;
> pcpu_cid->time = 0;
> }
>
> There is no way this is not visible already on 256 threads.
>
> Preferably some magic would be done to init this on first use on given
> CPU.There is some bitmap tracking CPU presence, maybe this can be
> tackled on top of it. But for the sake of argument let's say that's
> too expensive or perhaps not feasible. Even then, the walk can be done
> *once* by telling the percpu allocator to refrain from zeroing memory.
>
> Which brings me to rss counters. In the current kernel that's
> *another* loop over everything to zero it out. But it does not have to
> be that way. Suppose bitmap shenanigans mentioned above are no-go for
> these as well.
>
So I had another look and I think bitmapping it is perfectly feasible,
albeit requiring a little bit of refactoring to avoid adding overhead in
the common case.
There is a bitmap for tlb tracking, updated like so on context switch in
switch_mm_irqs_off():
if (next != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))
cpumask_set_cpu(cpu, mm_cpumask(next));
... and of course cleared at times.
Easiest way out would add an additional bitmap with bits which are
*never* cleared. But that's another cache miss, preferably avoided.
Instead the entire thing could be reimplemented to have 2 bits per CPU
in the bitmap -- one for tlb and another for ever running on it.
Having spotted you are running on the given cpu for the first time, the
rss area gets zeroed out and *both* bits get set et voila. The common
case gets away with the same load as always. The less common case gets
more work of having to zero the counters initialize cid.
In return both cid and rss handling can avoid mandatory linear walks by
cpu count, instead merely having to visit the cpus known to have used a
given mm.
I don't think this is particularly ugly or complicated, just needs some
care & time to sit through and refactor away all the direct access into
helpers.
So if I was tasked with working on the overall problem, I would
definitely try to get this done. Fortunately for me this is not the
case. :-)
On 2025-11-28 15:10, Jan Kara wrote: > On Fri 28-11-25 08:30:08, Mathieu Desnoyers wrote: [...] >> I suspect that by doing just that we'd get most of the >> performance benefits provided by the single-threaded special-case >> proposed here. > > I don't think so. Because in the profiles I have been doing for these > loads the biggest cost wasn't actually the per-cpu allocation itself but > the cost of zeroing the allocated counter for many CPUs (and then the > counter summarization on exit) and you're not going to get rid of that with > just reshuffling per-cpu fields and adding slab allocator in front. That's a good point ! So skipping the zeroing of per-cpu fields would indeed justify special-casing the single-threaded case. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
© 2016 - 2025 Red Hat, Inc.