mm/slub.c | 5 +++++ 1 file changed, 5 insertions(+)
From: Hao Li <haolee.swjtu@gmail.com>
When __pcs_replace_empty_main() fails to obtain a full sheaf directly
from the barn, it may either:
- Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
- Allocate a brand new full sheaf via alloc_full_sheaf().
After reacquiring the per-CPU lock, if pcs->main is still empty and
pcs->spare is NULL, the current code donates the empty main sheaf to
the barn via barn_put_empty_sheaf() and installs the full sheaf as
pcs->main, leaving pcs->spare unpopulated.
Instead, keep the existing empty main sheaf locally as the spare:
pcs->spare = pcs->main;
pcs->main = full;
This populates pcs->spare earlier, which can reduce future barn traffic.
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
---
The Gmail account(haoli.tcs) I used to send v1 of the patch has been
restricted from sending emails for unknown reasons, so I'm sending v2
from this address instead. Thanks.
mm/slub.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/slub.c b/mm/slub.c
index a0b905c2a557..a3e73ebb0cc8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
*/
if (pcs->main->size == 0) {
+ if (!pcs->spare) {
+ pcs->spare = pcs->main;
+ pcs->main = full;
+ return pcs;
+ }
barn_put_empty_sheaf(barn, pcs->main);
pcs->main = full;
return pcs;
--
2.50.1
On 12/10/25 01:26, Hao Li wrote:
> From: Hao Li <haolee.swjtu@gmail.com>
>
> When __pcs_replace_empty_main() fails to obtain a full sheaf directly
> from the barn, it may either:
>
> - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
> - Allocate a brand new full sheaf via alloc_full_sheaf().
>
> After reacquiring the per-CPU lock, if pcs->main is still empty and
> pcs->spare is NULL, the current code donates the empty main sheaf to
> the barn via barn_put_empty_sheaf() and installs the full sheaf as
> pcs->main, leaving pcs->spare unpopulated.
>
> Instead, keep the existing empty main sheaf locally as the spare:
>
> pcs->spare = pcs->main;
> pcs->main = full;
>
> This populates pcs->spare earlier, which can reduce future barn traffic.
>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
> ---
>
> The Gmail account(haoli.tcs) I used to send v1 of the patch has been
> restricted from sending emails for unknown reasons, so I'm sending v2
> from this address instead. Thanks.
>
> mm/slub.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index a0b905c2a557..a3e73ebb0cc8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> */
>
> if (pcs->main->size == 0) {
> + if (!pcs->spare) {
> + pcs->spare = pcs->main;
> + pcs->main = full;
> + return pcs;
> + }
> barn_put_empty_sheaf(barn, pcs->main);
> pcs->main = full;
> return pcs;
Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
adjusted like this:
diff --git a/mm/slub.c b/mm/slub.c
index f21b2f0c6f5a..ad71f01571f0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
*/
if (pcs->main->size == 0) {
- barn_put_empty_sheaf(barn, pcs->main);
+ if (!pcs->spare) {
+ pcs->spare = pcs->main;
+ } else {
+ barn_put_empty_sheaf(barn, pcs->main);
+ }
pcs->main = full;
return pcs;
}
Hi Babka & Hao,
> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> */
>
> if (pcs->main->size == 0) {
> - barn_put_empty_sheaf(barn, pcs->main);
> + if (!pcs->spare) {
> + pcs->spare = pcs->main;
> + } else {
> + barn_put_empty_sheaf(barn, pcs->main);
> + }
> pcs->main = full;
> return pcs;
> }
I noticed the previous lkp regression report and tested this fix:
* will-it-scale.per_process_ops
Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
results:
nr_tasks Delta
1 + 3.593%
8 + 3.094%
64 +60.247%
128 +49.344%
192 +27.500%
256 -12.077%
For the cases (nr_tasks: 1-192), there're the improvements. I think
this is expected since pre-cached spare sheaf reduces spinlock race:
reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
So (maybe too late),
Tested-by: Zhao Liu <zhao1.liu@intel.com>
But I find there are two more questions that might need consideration?
# Question 1: Regression for 256 tasks
For the above test - the case with nr_tasks: 256, there's a "slight"
regression. I did more testing:
(This is a single-round test; the 256-tasks data has jitter.)
nr_tasks Delta
244 0.308%
248 - 0.805%
252 12.070%
256 -11.441%
258 2.070%
260 1.252%
264 2.369%
268 -11.479%
272 2.130%
292 8.714%
296 10.905%
298 17.196%
300 11.783%
302 6.620%
304 3.112%
308 - 5.924%
It can be seen that most cases show improvement, though a few may
experience slight regression.
Based on the configuration of my machine:
GNR - 2 sockets with the following NUMA topology:
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0-42,172-214
NUMA node1 CPU(s): 43-85,215-257
NUMA node2 CPU(s): 86-128,258-300
NUMA node3 CPU(s): 129-171,301-343
Since I set the CPU affinity on the core, 256 cases is roughly
equivalent to the moment when Node 0 and Node 1 are filled.
The following is the perf data comparing 2 tests w/o fix & with this fix:
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ....................... ....................................
#
61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
0.93% -0.32% [kernel.vmlinux] [k] __slab_free
0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
0.26% -0.07% [kernel.vmlinux] [k] down_write
0.53% -0.06% libc.so.6 [.] __mmap
0.66% -0.06% [kernel.vmlinux] [k] mas_walk
0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
0.45% -0.06% [kernel.vmlinux] [k] mas_find
0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
0.41% -0.05% [kernel.vmlinux] [k] memcpy
0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
0.14% +0.04% [kernel.vmlinux] [k] __put_partials
0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
0.49% -0.04% libc.so.6 [.] __munmap
0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
0.27% -0.03% [kernel.vmlinux] [k] up_write
0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
I think the insteresting item is "get_partial_node". It seems this fix
makes "get_partial_node" slightly more frequent. HMM, however, I still
can't figure out why this is happening. Do you have any thoughts on it?
# Question 2: sheaf capacity
Back the original commit which triggerred lkp regression. I did more
testing to check if this fix could totally fill the regression gap.
The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
use percpu sheaves for maple_node_cache") has the regression.
I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
line:
nr_tasks w/o fix with fix
1 - 3.643% - 0.181%
8 -12.523% - 9.816%
64 -50.378% -20.482%
128 -36.736% - 5.518%
192 -22.963% - 1.777%
256 -32.926% - 41.026%
It appears that under extreme conditions, regression remains significate.
I remembered your suggestion about larger capacity and did the following
testing:
59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
(with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
1 -8.789% -8.805% -8.185% -9.912% -8.673%
8 -12.256% -9.219% -10.460% -10.070% -8.819%
64 -38.915% -8.172% -4.700% 4.571% 8.793%
128 -8.032% 11.377% 23.232% 26.940% 30.573%
192 -1.220% 9.758% 20.573% 22.645% 25.768%
256 -6.570% 9.967% 21.663% 30.103% 33.876%
Comparing with base line (3accabda4), larger capacity could
significatly improve the Sheaf's scalability.
So, I'd like to know if you think dynamically or adaptively adjusting
capacity is a worthwhile idea.
Thanks for your patience.
Regards,
Zhao
On Thu, Jan 15, 2026 at 06:12:44PM +0800, Zhao Liu wrote:
> Hi Babka & Hao,
>
> > Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> > adjusted like this:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f21b2f0c6f5a..ad71f01571f0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> > */
> >
> > if (pcs->main->size == 0) {
> > - barn_put_empty_sheaf(barn, pcs->main);
> > + if (!pcs->spare) {
> > + pcs->spare = pcs->main;
> > + } else {
> > + barn_put_empty_sheaf(barn, pcs->main);
> > + }
> > pcs->main = full;
> > return pcs;
> > }
>
> I noticed the previous lkp regression report and tested this fix:
>
> * will-it-scale.per_process_ops
>
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
>
> nr_tasks Delta
> 1 + 3.593%
> 8 + 3.094%
> 64 +60.247%
> 128 +49.344%
> 192 +27.500%
> 256 -12.077%
>
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
>
> So (maybe too late),
>
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
>
>
>
> But I find there are two more questions that might need consideration?
>
> # Question 1: Regression for 256 tasks
>
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
>
> (This is a single-round test; the 256-tasks data has jitter.)
>
> nr_tasks Delta
> 244 0.308%
> 248 - 0.805%
> 252 12.070%
> 256 -11.441%
> 258 2.070%
> 260 1.252%
> 264 2.369%
> 268 -11.479%
> 272 2.130%
> 292 8.714%
> 296 10.905%
> 298 17.196%
> 300 11.783%
> 302 6.620%
> 304 3.112%
> 308 - 5.924%
>
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
>
> Based on the configuration of my machine:
>
> GNR - 2 sockets with the following NUMA topology:
>
> NUMA:
> NUMA node(s): 4
> NUMA node0 CPU(s): 0-42,172-214
> NUMA node1 CPU(s): 43-85,215-257
> NUMA node2 CPU(s): 86-128,258-300
> NUMA node3 CPU(s): 129-171,301-343
>
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
>
> The following is the perf data comparing 2 tests w/o fix & with this fix:
>
> # Baseline Delta Abs Shared Object Symbol
> # ........ ......... ....................... ....................................
> #
> 61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 0.93% -0.32% [kernel.vmlinux] [k] __slab_free
> 0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
> 1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
> 3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
> 1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
> 0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> 0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
> 1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
> 1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
> 1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
> 0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
> 0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
> 0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
> 0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
> 0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
> 0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
> 0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
> 0.26% -0.07% [kernel.vmlinux] [k] down_write
> 0.53% -0.06% libc.so.6 [.] __mmap
> 0.66% -0.06% [kernel.vmlinux] [k] mas_walk
> 0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
> 0.45% -0.06% [kernel.vmlinux] [k] mas_find
> 0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
> 0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
> 0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
> 0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
> 0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
> 0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
> 0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
> 0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
> 0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
> 0.41% -0.05% [kernel.vmlinux] [k] memcpy
> 0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
> 0.14% +0.04% [kernel.vmlinux] [k] __put_partials
> 0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
> 0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
> 0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
> 0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
> 0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
> 0.49% -0.04% libc.so.6 [.] __munmap
> 0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
> 0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
> 0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
> 0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
> 0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
> 0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
> 0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
> 0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
> 0.27% -0.03% [kernel.vmlinux] [k] up_write
> 0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
> 0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
> 0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
> 0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
> 0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
>
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?
Hello, Zhao,
I tested the performance degradation issue we discussed concerning nr_tasks=256.
However, my results differ from yours, so I'd like to share my setup and
findings for clarity and comparison:
1. Machine Configuration
The topology of my machine is as follows:
CPU(s): 384
On-line CPU(s) list: 0-383
Thread(s) per core: 2
Core(s) per socket: 96
Socket(s): 2
NUMA node(s): 2
Since my machine only has 192 cores when counting physical cores, I had to
enable SMT to support the higher number of tasks in the LKP test cases. My
configuration was as follows:
will-it-scale:
mode: process
test: mmap2
no_affinity: 0
smt: 1
The sequence of test cases I used was: nr_tasks= 1, 8, 64, 128, 192, 256, 384.
I noticed that your test command did not enable SMT, but I believe this
difference should not significantly affect the results. I wanted to highlight
this to ensure we account for any potential impact these differences might have
on our results.
2. Kernel Configuration
I conducted tests using the commit f0b9d8eb98dfee8d00419aa07543bdc2c1a44fb1
first, then applied the patch and tested again.
Each test was run 10 times, and I took the average results.
3. Test Results (Without Patch vs. With Patch)
will-it-scale.1.processes -1.27%
will-it-scale.8.processes +0.19%
will-it-scale.64.processes +25.81%
will-it-scale.128.processes +112.88%
will-it-scale.192.processes +157.42%
will-it-scale.256.processes +70.63%
will-it-scale.384.processes +132.12%
will-it-scale.per_process_ops +27.21%
will-it-scale.scalability +135.10%
will-it-scale.time.involuntary_context_switches +127.54%
will-it-scale.time.voluntary_context_switches +0.01%
will-it-scale.workload +94.47%
From the above results, it appears that the patch improved performance across
the board.
4. Further Analysis
I conducted additional tests by running "./mmap2_processes -t 384 -s 25 -m" both
without and with the patch, and sampled the results using perf.
Here's the "perf report --no-children -g" output without the patch:
```
- 65.72% mmap2_processes [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
- 55.33% testcase
- 55.33% __mmap
- 55.32% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 55.30% ksys_mmap_pgoff
- 55.30% vm_mmap_pgoff
- 55.28% do_mmap
- 55.24% __mmap_region
- 44.35% mas_preallocate
- 44.34% mas_alloc_nodes
- 44.34% kmem_cache_alloc_noprof
- 44.33% __pcs_replace_empty_main
+ 21.23% barn_put_empty_sheaf
+ 15.95% barn_get_empty_sheaf
+ 5.50% barn_replace_empty_sheaf
+ 1.33% _raw_spin_unlock_irqrestore
+ 10.24% mas_store_prealloc
+ 0.56% perf_event_mmap
- 10.38% __munmap
- 10.38% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 10.36% __x64_sys_munmap
- 10.36% __vm_munmap
- 10.36% do_vmi_munmap
- 10.35% do_vmi_align_munmap
- 10.14% mas_store_gfp
- 10.13% mas_wr_node_store
- 10.09% kvfree_call_rcu
- 10.09% __kfree_rcu_sheaf
- 10.08% barn_get_empty_sheaf
+ 9.17% _raw_spin_lock_irqsave
+ 0.90% _raw_spin_unlock_irqrestore
```
Here's the "perf report --no-children -g" output with the patch:
```
+ 30.36% mmap2_processes [kernel.kallsyms] [k] perf_iterate_ctx
- 28.80% mmap2_processes [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
- 24.72% testcase
- 24.71% __mmap
- 24.68% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 24.61% ksys_mmap_pgoff
- 24.57% vm_mmap_pgoff
- 24.51% do_mmap
- 24.30% __mmap_region
- 18.33% mas_preallocate
- 18.30% mas_alloc_nodes
- 18.30% kmem_cache_alloc_noprof
- 18.28% __pcs_replace_empty_main
+ 9.06% barn_replace_empty_sheaf
+ 6.12% barn_get_empty_sheaf
+ 3.09% refill_sheaf
+ 2.94% mas_store_prealloc
+ 2.64% perf_event_mmap
- 4.07% __munmap
- 4.04% entry_SYSCALL_64_after_hwframe
- do_syscall_64
- 3.98% __x64_sys_munmap
- 3.98% __vm_munmap
- 3.95% do_vmi_munmap
- 3.91% do_vmi_align_munmap
- 2.98% mas_store_gfp
- 2.90% mas_wr_node_store
- 2.75% kvfree_call_rcu
- 2.73% __kfree_rcu_sheaf
- 2.71% barn_get_empty_sheaf
+ 1.68% _raw_spin_lock_irqsave
+ 1.03% _raw_spin_unlock_irqrestore
- 0.76% vms_complete_munmap_vmas
0.67% vms_clear_ptes.part.41
```
Using perf diff, I compared the results before and after applying the patch:
```
# Event 'cycles:P'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... .................... ..................................................
#
65.72% -36.92% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
14.65% +15.70% [kernel.kallsyms] [k] perf_iterate_ctx
2.10% +2.45% [kernel.kallsyms] [k] unmap_page_range
1.09% +1.26% [kernel.kallsyms] [k] mas_wr_node_store
1.01% +1.14% [kernel.kallsyms] [k] free_pgd_range
0.84% +0.92% [kernel.kallsyms] [k] __mmap_region
0.50% +0.76% [kernel.kallsyms] [k] memcpy
0.62% +0.63% [kernel.kallsyms] [k] __cond_resched
0.49% +0.51% [kernel.kallsyms] [k] mas_walk
0.39% +0.42% [kernel.kallsyms] [k] mas_empty_area_rev
0.32% +0.40% [kernel.kallsyms] [k] mas_next_slot
0.34% +0.39% [kernel.kallsyms] [k] refill_sheaf
0.26% +0.36% [kernel.kallsyms] [k] mas_prev_slot
0.24% +0.29% [kernel.kallsyms] [k] do_syscall_64
0.25% +0.28% [kernel.kallsyms] [k] mas_find
0.20% +0.28% [kernel.kallsyms] [k] kmem_cache_alloc_noprof
0.24% +0.27% [kernel.kallsyms] [k] strlen
0.26% +0.27% [kernel.kallsyms] [k] perf_event_mmap
0.25% +0.26% [kernel.kallsyms] [k] do_mmap
0.22% +0.25% [kernel.kallsyms] [k] mas_store_gfp
0.25% +0.24% [kernel.kallsyms] [k] mas_leaf_max_gap
```
I also sampled the execution counts of several key functions using bpftrace.
Without Patch:
```
@cnt[barn_put_empty_sheaf]: 38833037
@cnt[barn_replace_empty_sheaf]: 41883891
@cnt[__pcs_replace_empty_main]: 41884885
@cnt[barn_get_empty_sheaf]: 75422518
@cnt[mmap]: 489634255
```
With Patch:
```
@cnt[barn_put_empty_sheaf]: 2382910
@cnt[barn_replace_empty_sheaf]: 90681637
@cnt[__pcs_replace_empty_main]: 90683656
@cnt[barn_get_empty_sheaf]: 82710919
@cnt[mmap]: 1113853385
```
From the above results, I found that the execution count of the
barn_put_empty_sheaf function dropped by an order of magnitude after applying
the patch. This is likely due to the patch's effect: when pcs->spare is NULL,
the empty sheaf is cached in pcs->spare instead of calling barn_put_empty_sheaf.
This reduces contention on the barn spinlock significantly.
At the same time, I noticed that the execution counts for
barn_replace_empty_sheaf and __pcs_replace_empty_main increased, but their
proportion in the perf sampling decreased. This suggests that the average
execution time for these functions has decreased.
Moreover, the total number of mmap executions after applying the patch
(1113853385) is more than double that of the unpatched kernel (489634255). This
further supports our analysis: since the test case duration is fixed at 25
seconds, the patched kernel runs faster, resulting in more iterations of the
test case and more mmap executions, which in turn increases the frequency of
these functions being called.
Based on my tests, everything appears reasonable and explainable. However, I
couldn't reproduce the performance drop for nr_tasks=256, and it's unclear why
our results differ. I'd appreciate it if you could share any additional insights
or thoughts on what might be causing this discrepancy. If needed, we could also
consult Vlastimil for further suggestions to better understand the issue or
explore other potential factors.
Thanks!
--
Thanks,
Hao
>
> # Question 2: sheaf capacity
>
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
>
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
>
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
>
> nr_tasks w/o fix with fix
> 1 - 3.643% - 0.181%
> 8 -12.523% - 9.816%
> 64 -50.378% -20.482%
> 128 -36.736% - 5.518%
> 192 -22.963% - 1.777%
> 256 -32.926% - 41.026%
>
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
>
> 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
> (with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
> 1 -8.789% -8.805% -8.185% -9.912% -8.673%
> 8 -12.256% -9.219% -10.460% -10.070% -8.819%
> 64 -38.915% -8.172% -4.700% 4.571% 8.793%
> 128 -8.032% 11.377% 23.232% 26.940% 30.573%
> 192 -1.220% 9.758% 20.573% 22.645% 25.768%
> 256 -6.570% 9.967% 21.663% 30.103% 33.876%
>
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
>
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.
>
> Thanks for your patience.
>
> Regards,
> Zhao
>
>
> 1. Machine Configuration > > The topology of my machine is as follows: > > CPU(s): 384 > On-line CPU(s) list: 0-383 > Thread(s) per core: 2 > Core(s) per socket: 96 > Socket(s): 2 > NUMA node(s): 2 It seems like this is a GNR machine - maybe SNC could be enabled. > Since my machine only has 192 cores when counting physical cores, I had to > enable SMT to support the higher number of tasks in the LKP test cases. My > configuration was as follows: > > will-it-scale: > mode: process > test: mmap2 > no_affinity: 0 > smt: 1 For lkp, smt parameter is disabled. I tried with smt=1 locally, the difference between "with fix" & "w/o fix" is not significate. Maybe smt parameter could be set as 0. On another machine (2 sockets with SNC3 enabled - 6 NUMA nodes), there's the similar regression happening when tasks fill up a socket and then there're more get_partial_node(). > Here's the "perf report --no-children -g" output with the patch: > > ``` > + 30.36% mmap2_processes [kernel.kallsyms] [k] perf_iterate_ctx > - 28.80% mmap2_processes [kernel.kallsyms] [k] native_queued_spin_lock_slowpath > - 24.72% testcase > - 24.71% __mmap > - 24.68% entry_SYSCALL_64_after_hwframe > - do_syscall_64 > - 24.61% ksys_mmap_pgoff > - 24.57% vm_mmap_pgoff > - 24.51% do_mmap > - 24.30% __mmap_region > - 18.33% mas_preallocate > - 18.30% mas_alloc_nodes > - 18.30% kmem_cache_alloc_noprof > - 18.28% __pcs_replace_empty_main > + 9.06% barn_replace_empty_sheaf > + 6.12% barn_get_empty_sheaf > + 3.09% refill_sheaf this is the difference with my previous perf report: here the proportion of refill_sheaf is low - it indicates the shaeves are enough in the most time. Back to my previous test, I'm guessing that with this fix, under extreme conditions of massive mmap usage, each CPU now stores an empty spare sheaf locally. Previously, each CPU's spare sheaf was NULL. So memory pressure increases with more spare sheaves locally. And in that extreme scenario, cross-socket remote NUMA access incurs significant overhead — which is why regression occurs here. However, testing from 1 task to max tasks (nr_tasks = nr_logical_cpus) shows overall significant improvements in most scenarios. Regressions only occur at the specific topology boundaries described above. I believe the cases with performance gains are more common. So I think the regression is a corner case. If it does indeed impact certain workloads in the future, we may need to reconsider optimization at that time. It can now be used as a reference. Thanks, Zhao
On Tue, Jan 20, 2026 at 04:21:16PM +0800, Zhao Liu wrote: Hi, Zhao, Thanks again for your thorough testing and detailed feedback - I really appreciate your help. > > 1. Machine Configuration > > > > The topology of my machine is as follows: > > > > CPU(s): 384 > > On-line CPU(s) list: 0-383 > > Thread(s) per core: 2 > > Core(s) per socket: 96 > > Socket(s): 2 > > NUMA node(s): 2 > > It seems like this is a GNR machine - maybe SNC could be enabled. Actually, my cpu is AMD EPYC 96-Core Processor. SNC is disabled, and there's only one NUMA node per socket. > > > Since my machine only has 192 cores when counting physical cores, I had to > > enable SMT to support the higher number of tasks in the LKP test cases. My > > configuration was as follows: > > > > will-it-scale: > > mode: process > > test: mmap2 > > no_affinity: 0 > > smt: 1 > > For lkp, smt parameter is disabled. I tried with smt=1 locally, the > difference between "with fix" & "w/o fix" is not significate. Maybe smt > parameter could be set as 0. Just to confirm: do you mean that on your machine, when smt=1, the performance difference between "with fix" and "without fix" is not significant - regardless of whether it's a gain or regression? Thanks. > > On another machine (2 sockets with SNC3 enabled - 6 NUMA nodes), there's > the similar regression happening when tasks fill up a socket and then > there're more get_partial_node(). From a theoretical standpoint, it seems like having more nodes should reduce lock contention, not increase it... By the way, I wanted to confirm one thing: in your earlier perf data, I noticed that the sampling ratio of native_queued_spin_lock_slowpath and get_partial_node slightly increased with the patch. Does this suggest that the lock contention you're observing mainly comes from kmem_cache_node->list_lock rather than node_barn->lock? If possible, could you help confirm this using "perf report -g" to see where the contention is coming from? > > > Here's the "perf report --no-children -g" output with the patch: > > > > ``` > > + 30.36% mmap2_processes [kernel.kallsyms] [k] perf_iterate_ctx > > - 28.80% mmap2_processes [kernel.kallsyms] [k] native_queued_spin_lock_slowpath > > - 24.72% testcase > > - 24.71% __mmap > > - 24.68% entry_SYSCALL_64_after_hwframe > > - do_syscall_64 > > - 24.61% ksys_mmap_pgoff > > - 24.57% vm_mmap_pgoff > > - 24.51% do_mmap > > - 24.30% __mmap_region > > - 18.33% mas_preallocate > > - 18.30% mas_alloc_nodes > > - 18.30% kmem_cache_alloc_noprof > > - 18.28% __pcs_replace_empty_main > > + 9.06% barn_replace_empty_sheaf > > + 6.12% barn_get_empty_sheaf > > + 3.09% refill_sheaf > > this is the difference with my previous perf report: here the proportion > of refill_sheaf is low - it indicates the shaeves are enough in the most > time. > > Back to my previous test, I'm guessing that with this fix, under extreme > conditions of massive mmap usage, each CPU now stores an empty spare sheaf > locally. Previously, each CPU's spare sheaf was NULL. So memory pressure > increases with more spare sheaves locally. I'm not quite sure about this point - my intuition is that this shouldn't consume a significant amount of memory. > And in that extreme scenario, > cross-socket remote NUMA access incurs significant overhead — which is why > regression occurs here. This part I haven't fully figured out yet - still looking into it. > > However, testing from 1 task to max tasks (nr_tasks = nr_logical_cpus) > shows overall significant improvements in most scenarios. Regressions > only occur at the specific topology boundaries described above. It does look like there's some underlying factor at play, triggering a performance tipping point. Though I haven't yet figured out the exact pattern. > > I believe the cases with performance gains are more common. So I think > the regression is a corner case. If it does indeed impact certain > workloads in the future, we may need to reconsider optimization at that > time. It can now be used as a reference. Agreed — this seems to be a corner case, and your test results have been really helpful as a reference. Thanks again for the great support and insightful discussion. -- Thanks, Hao
> Thanks again for your thorough testing and detailed feedback - I really
> appreciate your help.
You're welcome and thanks for your patinece!
> > It seems like this is a GNR machine - maybe SNC could be enabled.
>
> Actually, my cpu is AMD EPYC 96-Core Processor. SNC is disabled, and
> there's only one NUMA node per socket.
That's interesting.
> > For lkp, smt parameter is disabled. I tried with smt=1 locally, the
> > difference between "with fix" & "w/o fix" is not significate. Maybe smt
> > parameter could be set as 0.
>
> Just to confirm: do you mean that on your machine, when smt=1, the performance
> difference between "with fix" and "without fix" is not significant - regardless
> of whether it's a gain or regression? Thanks.
Yes, that's what I found on my machine. Given that you're using an AMD machine,
performance differences arise due to hardware difference :).
> > On another machine (2 sockets with SNC3 enabled - 6 NUMA nodes), there's
> > the similar regression happening when tasks fill up a socket and then
> > there're more get_partial_node().
>
> From a theoretical standpoint, it seems like having more nodes should reduce
> lock contention, not increase it...
>
> By the way, I wanted to confirm one thing: in your earlier perf data, I noticed
> that the sampling ratio of native_queued_spin_lock_slowpath and get_partial_node
> slightly increased with the patch. Does this suggest that the lock contention
> you're observing mainly comes from kmem_cache_node->list_lock rather than
> node_barn->lock?
Yes, I think so.
> If possible, could you help confirm this using "perf report -g" to see where the
> contention is coming from?
No problem,
- 42.82% 42.82% mmap2_processes [kernel.vmlinux] [k] native_queued_spin_lock_slowpath ▒
- 42.17% __mmap ▒
- 42.17% entry_SYSCALL_64_after_hwframe ▒
- do_syscall_64 ▒
- 42.16% ksys_mmap_pgoff ▒
- 42.16% vm_mmap_pgoff ▒
- 42.15% do_mmap ▒
- 42.14% __mmap_region ▒
- 42.09% __mmap_new_vma ▒
- 41.59% mas_preallocate ▒
- 41.59% kmem_cache_alloc_noprof ▒
- 41.58% __pcs_replace_empty_main ▒
- 40.38% __kmem_cache_alloc_bulk ▒
- 40.38% ___slab_alloc ▒
- 28.62% get_any_partial ▒
- 28.61% get_partial_node ▒
+ 28.25% _raw_spin_lock_irqsave ▒
- 11.76% get_partial_node ▒
+ 11.66% _raw_spin_lock_irqsave ▒
- 1.00% barn_replace_empty_sheaf ▒
+ 0.95% _raw_spin_lock_irqsave ▒
+ 0.65% __munmap
> > Back to my previous test, I'm guessing that with this fix, under extreme
> > conditions of massive mmap usage, each CPU now stores an empty spare sheaf
> > locally. Previously, each CPU's spare sheaf was NULL. So memory pressure
> > increases with more spare sheaves locally.
>
> I'm not quite sure about this point - my intuition is that this shouldn't
> consume a significant amount of memory.
>
> > And in that extreme scenario,
> > cross-socket remote NUMA access incurs significant overhead — which is why
> > regression occurs here.
>
> This part I haven't fully figured out yet - still looking into it.
This part is hard to say; it could also be due to certain differences in
the hardware itself so that your machine didn't meet.
> > However, testing from 1 task to max tasks (nr_tasks = nr_logical_cpus)
> > shows overall significant improvements in most scenarios. Regressions
> > only occur at the specific topology boundaries described above.
>
> It does look like there's some underlying factor at play, triggering a
> performance tipping point. Though I haven't yet figured out the exact pattern.
For details, on my machines, test where nr_task ranges from 0, 1, 4, 8 all the
way up to max_cpus, and I plot the score curves with and without the fix to
observe how the fix behaves under different conditions.
> > I believe the cases with performance gains are more common. So I think
> > the regression is a corner case. If it does indeed impact certain
> > workloads in the future, we may need to reconsider optimization at that
> > time. It can now be used as a reference.
>
> Agreed — this seems to be a corner case, and your test results have been really
> helpful as a reference. Thanks again for the great support and insightful
> discussion.
It's been a pleasure communicating with you. :)
Thanks,
Zhao
On Thu, Jan 15, 2026 at 06:12:44PM +0800, Zhao Liu wrote:
> Hi Babka & Hao,
>
> > Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> > adjusted like this:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f21b2f0c6f5a..ad71f01571f0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> > */
> >
> > if (pcs->main->size == 0) {
> > - barn_put_empty_sheaf(barn, pcs->main);
> > + if (!pcs->spare) {
> > + pcs->spare = pcs->main;
> > + } else {
> > + barn_put_empty_sheaf(barn, pcs->main);
> > + }
> > pcs->main = full;
> > return pcs;
> > }
>
> I noticed the previous lkp regression report and tested this fix:
>
> * will-it-scale.per_process_ops
>
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
>
> nr_tasks Delta
> 1 + 3.593%
> 8 + 3.094%
> 64 +60.247%
> 128 +49.344%
> 192 +27.500%
> 256 -12.077%
>
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
>
> So (maybe too late),
>
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
Hello, Zhao,
Thanks for running the performance test!
>
>
>
> But I find there are two more questions that might need consideration?
>
> # Question 1: Regression for 256 tasks
>
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
>
> (This is a single-round test; the 256-tasks data has jitter.)
>
> nr_tasks Delta
> 244 0.308%
> 248 - 0.805%
> 252 12.070%
> 256 -11.441%
> 258 2.070%
> 260 1.252%
> 264 2.369%
> 268 -11.479%
> 272 2.130%
> 292 8.714%
> 296 10.905%
> 298 17.196%
> 300 11.783%
> 302 6.620%
> 304 3.112%
> 308 - 5.924%
>
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
>
> Based on the configuration of my machine:
>
> GNR - 2 sockets with the following NUMA topology:
>
> NUMA:
> NUMA node(s): 4
> NUMA node0 CPU(s): 0-42,172-214
> NUMA node1 CPU(s): 43-85,215-257
> NUMA node2 CPU(s): 86-128,258-300
> NUMA node3 CPU(s): 129-171,301-343
>
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
>
> The following is the perf data comparing 2 tests w/o fix & with this fix:
>
> # Baseline Delta Abs Shared Object Symbol
> # ........ ......... ....................... ....................................
> #
> 61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 0.93% -0.32% [kernel.vmlinux] [k] __slab_free
> 0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
> 1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
> 3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
> 1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
> 0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> 0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
> 1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
> 1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
> 1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
> 0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
> 0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
> 0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
> 0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
> 0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
> 0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
> 0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
> 0.26% -0.07% [kernel.vmlinux] [k] down_write
> 0.53% -0.06% libc.so.6 [.] __mmap
> 0.66% -0.06% [kernel.vmlinux] [k] mas_walk
> 0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
> 0.45% -0.06% [kernel.vmlinux] [k] mas_find
> 0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
> 0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
> 0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
> 0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
> 0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
> 0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
> 0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
> 0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
> 0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
> 0.41% -0.05% [kernel.vmlinux] [k] memcpy
> 0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
> 0.14% +0.04% [kernel.vmlinux] [k] __put_partials
> 0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
> 0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
> 0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
> 0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
> 0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
> 0.49% -0.04% libc.so.6 [.] __munmap
> 0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
> 0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
> 0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
> 0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
> 0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
> 0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
> 0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
> 0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
> 0.27% -0.03% [kernel.vmlinux] [k] up_write
> 0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
> 0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
> 0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
> 0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
> 0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
>
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?
I'd like to dig a bit deeper to confirm whether the "256 tasks" result is truly
a regression. Could you please share the original full report, or let me know
which test case under will-it-scale/ you used?
--
Thanks,
Hao
>
> # Question 2: sheaf capacity
>
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
>
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
>
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
>
> nr_tasks w/o fix with fix
> 1 - 3.643% - 0.181%
> 8 -12.523% - 9.816%
> 64 -50.378% -20.482%
> 128 -36.736% - 5.518%
> 192 -22.963% - 1.777%
> 256 -32.926% - 41.026%
>
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
>
> 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
> (with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
> 1 -8.789% -8.805% -8.185% -9.912% -8.673%
> 8 -12.256% -9.219% -10.460% -10.070% -8.819%
> 64 -38.915% -8.172% -4.700% 4.571% 8.793%
> 128 -8.032% 11.377% 23.232% 26.940% 30.573%
> 192 -1.220% 9.758% 20.573% 22.645% 25.768%
> 256 -6.570% 9.967% 21.663% 30.103% 33.876%
>
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
>
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.
>
> Thanks for your patience.
>
> Regards,
> Zhao
>
>
> I'd like to dig a bit deeper to confirm whether the "256 tasks" result is truly > a regression. The "256" seems align closely with the NUMA topology on my machine, so I'm unsure how it will perform on other machines. > Could you please share the original full report, or let me know > which test case under will-it-scale/ you used? I mainly followed Suneeth's steps [*]: 1) git clone https://github.com/antonblanchard/will-it-scale.git 2) git clone https://github.com/intel/lkp-tests.git 3) cd will-it-scale && git apply lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch 4) make 5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256 [*]: https://lore.kernel.org/all/262c742f-dc0c-4adc-b23c-047cd3298a5e@amd.com/ Sine the raw perf.data files are too big to be blocked, if you need to see any specific part of the content, I can paste the info for you. Regards, Zhao
On Fri, Jan 16, 2026 at 05:16:03PM +0800, Zhao Liu wrote: > > I'd like to dig a bit deeper to confirm whether the "256 tasks" result is truly > > a regression. > > The "256" seems align closely with the NUMA topology on my machine, so > I'm unsure how it will perform on other machines. Got it. Thanks, in any case, I'll try to reproduce it first. > > > Could you please share the original full report, or let me know > > which test case under will-it-scale/ you used? > > I mainly followed Suneeth's steps [*]: > > 1) git clone https://github.com/antonblanchard/will-it-scale.git > 2) git clone https://github.com/intel/lkp-tests.git > 3) cd will-it-scale && git apply > lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch > 4) make > 5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256 > > [*]: https://lore.kernel.org/all/262c742f-dc0c-4adc-b23c-047cd3298a5e@amd.com/ Thanks! > > Sine the raw perf.data files are too big to be blocked, if you need to > see any specific part of the content, I can paste the info for you. > > Regards, > Zhao >
On 1/15/26 11:12, Zhao Liu wrote:
> Hi Babka & Hao,
>
>> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
>> adjusted like this:
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index f21b2f0c6f5a..ad71f01571f0 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>> */
>>
>> if (pcs->main->size == 0) {
>> - barn_put_empty_sheaf(barn, pcs->main);
>> + if (!pcs->spare) {
>> + pcs->spare = pcs->main;
>> + } else {
>> + barn_put_empty_sheaf(barn, pcs->main);
>> + }
>> pcs->main = full;
>> return pcs;
>> }
>
> I noticed the previous lkp regression report and tested this fix:
>
> * will-it-scale.per_process_ops
>
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
>
> nr_tasks Delta
> 1 + 3.593%
> 8 + 3.094%
> 64 +60.247%
> 128 +49.344%
> 192 +27.500%
> 256 -12.077%
>
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
>
> So (maybe too late),
>
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
Thanks!
> But I find there are two more questions that might need consideration?
>
> # Question 1: Regression for 256 tasks
>
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
>
> (This is a single-round test; the 256-tasks data has jitter.)
>
> nr_tasks Delta
> 244 0.308%
> 248 - 0.805%
> 252 12.070%
> 256 -11.441%
> 258 2.070%
> 260 1.252%
> 264 2.369%
> 268 -11.479%
> 272 2.130%
> 292 8.714%
> 296 10.905%
> 298 17.196%
> 300 11.783%
> 302 6.620%
> 304 3.112%
> 308 - 5.924%
>
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
>
> Based on the configuration of my machine:
>
> GNR - 2 sockets with the following NUMA topology:
>
> NUMA:
> NUMA node(s): 4
> NUMA node0 CPU(s): 0-42,172-214
> NUMA node1 CPU(s): 43-85,215-257
> NUMA node2 CPU(s): 86-128,258-300
> NUMA node3 CPU(s): 129-171,301-343
>
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
>
> The following is the perf data comparing 2 tests w/o fix & with this fix:
>
> # Baseline Delta Abs Shared Object Symbol
> # ........ ......... ....................... ....................................
> #
> 61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> 0.93% -0.32% [kernel.vmlinux] [k] __slab_free
> 0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
> 1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
> 3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
> 1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
> 0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> 0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
> 1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
> 1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
> 1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
> 0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
> 0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
> 0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
> 0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
> 0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
> 0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
> 0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
> 0.26% -0.07% [kernel.vmlinux] [k] down_write
> 0.53% -0.06% libc.so.6 [.] __mmap
> 0.66% -0.06% [kernel.vmlinux] [k] mas_walk
> 0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
> 0.45% -0.06% [kernel.vmlinux] [k] mas_find
> 0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
> 0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
> 0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
> 0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
> 0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
> 0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
> 0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
> 0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
> 0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
> 0.41% -0.05% [kernel.vmlinux] [k] memcpy
> 0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
> 0.14% +0.04% [kernel.vmlinux] [k] __put_partials
> 0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
> 0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
> 0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
> 0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
> 0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
> 0.49% -0.04% libc.so.6 [.] __munmap
> 0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
> 0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
> 0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
> 0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
> 0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
> 0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
> 0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
> 0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
> 0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
> 0.27% -0.03% [kernel.vmlinux] [k] up_write
> 0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
> 0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
> 0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
> 0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
> 0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
>
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?
I'm not sure if it's statistically significant or just noise, +0.09% could
be noise?
> # Question 2: sheaf capacity
>
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
>
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
>
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
>
> nr_tasks w/o fix with fix
> 1 - 3.643% - 0.181%
> 8 -12.523% - 9.816%
> 64 -50.378% -20.482%
> 128 -36.736% - 5.518%
> 192 -22.963% - 1.777%
> 256 -32.926% - 41.026%
>
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
>
> 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4 59faa4da7cd4
> (with this fix) (cap: 32->64) (cap: 32->128) (cap: 32->256)
> 1 -8.789% -8.805% -8.185% -9.912% -8.673%
> 8 -12.256% -9.219% -10.460% -10.070% -8.819%
> 64 -38.915% -8.172% -4.700% 4.571% 8.793%
> 128 -8.032% 11.377% 23.232% 26.940% 30.573%
> 192 -1.220% 9.758% 20.573% 22.645% 25.768%
> 256 -6.570% 9.967% 21.663% 30.103% 33.876%
>
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
>
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.
In the followup series, there will be automatically determined capacity to
roughly match the current capacity of cpu partial slabs:
https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/
We can use that as starting point for further tuning. But I suspect making
it adjust dynamically would be complicated.
> Thanks for your patience.
>
> Regards,
> Zhao
>
> > The following is the perf data comparing 2 tests w/o fix & with this fix:
> >
> > # Baseline Delta Abs Shared Object Symbol
> > # ........ ......... ....................... ....................................
> > #
> > 61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
> > 0.93% -0.32% [kernel.vmlinux] [k] __slab_free
> > 0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf
> > 1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap
> > 3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk
> > 1.73% -0.20% [kernel.vmlinux] [k] __cond_resched
> > 0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
> > 0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock
> > 1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0
> > 1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store
> > 1.29% -0.12% [kernel.vmlinux] [k] free_pud_range
> > 0.92% -0.11% [kernel.vmlinux] [k] __mmap_region
> > 0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf
> > 0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf
> > 0.31% +0.09% [kernel.vmlinux] [k] get_partial_node
> > 0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare
> > 0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate
> > 0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf
> > 0.26% -0.07% [kernel.vmlinux] [k] down_write
> > 0.53% -0.06% libc.so.6 [.] __mmap
> > 0.66% -0.06% [kernel.vmlinux] [k] mas_walk
> > 0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot
> > 0.45% -0.06% [kernel.vmlinux] [k] mas_find
> > 0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type
> > 0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap
> > 0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event
> > 0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
> > 0.19% -0.05% [kernel.vmlinux] [k] downgrade_write
> > 0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot
> > 0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma
> > 0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof
> > 0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked
> > 0.41% -0.05% [kernel.vmlinux] [k] memcpy
> > 0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp
> > 0.14% +0.04% [kernel.vmlinux] [k] __put_partials
> > 0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev
> > 0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64
> > 0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate
> > 0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf
> > 0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64
> > 0.49% -0.04% libc.so.6 [.] __munmap
> > 0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs
> > 0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap
> > 0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc
> > 0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof
> > 0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist
> > 0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas
> > 0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk
> > 0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist
> > 0.19% -0.03% [kernel.vmlinux] [k] down_write_killable
> > 0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free
> > 0.27% -0.03% [kernel.vmlinux] [k] up_write
> > 0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc
> > 0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown
> > 0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete
> > 0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu
> > 0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc
> >
> > I think the insteresting item is "get_partial_node". It seems this fix
> > makes "get_partial_node" slightly more frequent. HMM, however, I still
> > can't figure out why this is happening. Do you have any thoughts on it?
>
> I'm not sure if it's statistically significant or just noise, +0.09% could
> be noise?
small number does't always mean it's noise. When perf samples get_partial_node
on the spin lock call chain, its subroutines (spin lock) are hotter, so
the proportion of subroutine execution is higher. If the function -
get_partial_node itself (excluding subroutines) executes very quickly,
the proportion is lower.
I also expend the perf data with call chain:
* w/o fix:
We can calculate the proportion of spin locks introduced by get_partial_node
is: 31.05% / 49.91% = 62.21%
49.91% mmap2_processes [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
|
--49.91%--native_queued_spin_lock_slowpath
|
--49.91%--_raw_spin_lock_irqsave
|
|--31.05%--get_partial_node
| |
| |--23.66%--get_any_partial
| | ___slab_alloc
| |
| --7.40%--___slab_alloc
| __kmem_cache_alloc_bulk
|
|--10.84%--barn_get_empty_sheaf
| |
| |--6.18%--__kfree_rcu_sheaf
| | kvfree_call_rcu
| |
| --4.66%--__pcs_replace_empty_main
| kmem_cache_alloc_noprof
|
|--5.10%--barn_put_empty_sheaf
| |
| --5.09%--__pcs_replace_empty_main
| kmem_cache_alloc_noprof
|
|--2.01%--barn_replace_empty_sheaf
| __pcs_replace_empty_main
| kmem_cache_alloc_noprof
|
--0.78%--__put_partials
|
--0.78%--__kmem_cache_free_bulk.part.0
rcu_free_sheaf
* with fix:
Similarly, the proportion of spin locks introduced by get_partial_node
is: 39.91% / 42.82% = 93.20%
42.82% mmap2_processes [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
|
---native_queued_spin_lock_slowpath
|
--42.82%--_raw_spin_lock_irqsave
|
|--39.91%--get_partial_node
| |
| |--28.25%--get_any_partial
| | ___slab_alloc
| |
| --11.66%--___slab_alloc
| __kmem_cache_alloc_bulk
|
|--1.09%--barn_get_empty_sheaf
| |
| --0.90%--__kfree_rcu_sheaf
| kvfree_call_rcu
|
|--0.96%--barn_replace_empty_sheaf
| __pcs_replace_empty_main
| kmem_cache_alloc_noprof
|
--0.77%--__put_partials
__kmem_cache_free_bulk.part.0
rcu_free_sheaf
So, 62.21% -> 93.20% could reflect that get_partial_node contribute more
overhead at this point.
> > So, I'd like to know if you think dynamically or adaptively adjusting
> > capacity is a worthwhile idea.
>
> In the followup series, there will be automatically determined capacity to
> roughly match the current capacity of cpu partial slabs:
>
> https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/
>
> We can use that as starting point for further tuning. But I suspect making
> it adjust dynamically would be complicated.
Thanks, will continue to evaluate this series.
Regards,
Zhao
On Fri, Jan 16, 2026 at 05:07:30PM +0800, Zhao Liu wrote: > > > The following is the perf data comparing 2 tests w/o fix & with this fix: > > > > > > # Baseline Delta Abs Shared Object Symbol > > > # ........ ......... ....................... .................................... > > > # > > > 61.76% +4.78% [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > > > 0.93% -0.32% [kernel.vmlinux] [k] __slab_free > > > 0.39% -0.31% [kernel.vmlinux] [k] barn_get_empty_sheaf > > > 1.35% -0.30% [kernel.vmlinux] [k] mas_leaf_max_gap > > > 3.22% -0.30% [kernel.vmlinux] [k] __kmem_cache_alloc_bulk > > > 1.73% -0.20% [kernel.vmlinux] [k] __cond_resched > > > 0.52% -0.19% [kernel.vmlinux] [k] _raw_spin_lock_irqsave > > > 0.92% +0.18% [kernel.vmlinux] [k] _raw_spin_lock > > > 1.91% -0.15% [kernel.vmlinux] [k] zap_pmd_range.isra.0 > > > 1.37% -0.13% [kernel.vmlinux] [k] mas_wr_node_store > > > 1.29% -0.12% [kernel.vmlinux] [k] free_pud_range > > > 0.92% -0.11% [kernel.vmlinux] [k] __mmap_region > > > 0.12% -0.11% [kernel.vmlinux] [k] barn_put_empty_sheaf > > > 0.20% -0.09% [kernel.vmlinux] [k] barn_replace_empty_sheaf > > > 0.31% +0.09% [kernel.vmlinux] [k] get_partial_node > > > 0.29% -0.07% [kernel.vmlinux] [k] __rcu_free_sheaf_prepare > > > 0.12% -0.07% [kernel.vmlinux] [k] intel_idle_xstate > > > 0.21% -0.07% [kernel.vmlinux] [k] __kfree_rcu_sheaf > > > 0.26% -0.07% [kernel.vmlinux] [k] down_write > > > 0.53% -0.06% libc.so.6 [.] __mmap > > > 0.66% -0.06% [kernel.vmlinux] [k] mas_walk > > > 0.48% -0.06% [kernel.vmlinux] [k] mas_prev_slot > > > 0.45% -0.06% [kernel.vmlinux] [k] mas_find > > > 0.38% -0.06% [kernel.vmlinux] [k] mas_wr_store_type > > > 0.23% -0.06% [kernel.vmlinux] [k] do_vmi_align_munmap > > > 0.21% -0.05% [kernel.vmlinux] [k] perf_event_mmap_event > > > 0.32% -0.05% [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack > > > 0.19% -0.05% [kernel.vmlinux] [k] downgrade_write > > > 0.59% -0.05% [kernel.vmlinux] [k] mas_next_slot > > > 0.31% -0.05% [kernel.vmlinux] [k] __mmap_new_vma > > > 0.44% -0.05% [kernel.vmlinux] [k] kmem_cache_alloc_noprof > > > 0.28% -0.05% [kernel.vmlinux] [k] __vma_enter_locked > > > 0.41% -0.05% [kernel.vmlinux] [k] memcpy > > > 0.48% -0.04% [kernel.vmlinux] [k] mas_store_gfp > > > 0.14% +0.04% [kernel.vmlinux] [k] __put_partials > > > 0.19% -0.04% [kernel.vmlinux] [k] mas_empty_area_rev > > > 0.30% -0.04% [kernel.vmlinux] [k] do_syscall_64 > > > 0.25% -0.04% [kernel.vmlinux] [k] mas_preallocate > > > 0.15% -0.04% [kernel.vmlinux] [k] rcu_free_sheaf > > > 0.22% -0.04% [kernel.vmlinux] [k] entry_SYSCALL_64 > > > 0.49% -0.04% libc.so.6 [.] __munmap > > > 0.91% -0.04% [kernel.vmlinux] [k] rcu_all_qs > > > 0.21% -0.04% [kernel.vmlinux] [k] __vm_munmap > > > 0.24% -0.04% [kernel.vmlinux] [k] mas_store_prealloc > > > 0.19% -0.04% [kernel.vmlinux] [k] __kmalloc_cache_noprof > > > 0.34% -0.04% [kernel.vmlinux] [k] build_detached_freelist > > > 0.19% -0.03% [kernel.vmlinux] [k] vms_complete_munmap_vmas > > > 0.36% -0.03% [kernel.vmlinux] [k] mas_rev_awalk > > > 0.05% -0.03% [kernel.vmlinux] [k] shuffle_freelist > > > 0.19% -0.03% [kernel.vmlinux] [k] down_write_killable > > > 0.19% -0.03% [kernel.vmlinux] [k] kmem_cache_free > > > 0.27% -0.03% [kernel.vmlinux] [k] up_write > > > 0.13% -0.03% [kernel.vmlinux] [k] vm_area_alloc > > > 0.18% -0.03% [kernel.vmlinux] [k] arch_get_unmapped_area_topdown > > > 0.08% -0.03% [kernel.vmlinux] [k] userfaultfd_unmap_complete > > > 0.10% -0.03% [kernel.vmlinux] [k] tlb_gather_mmu > > > 0.30% -0.02% [kernel.vmlinux] [k] ___slab_alloc > > > > > > I think the insteresting item is "get_partial_node". It seems this fix > > > makes "get_partial_node" slightly more frequent. HMM, however, I still > > > can't figure out why this is happening. Do you have any thoughts on it? > > > > I'm not sure if it's statistically significant or just noise, +0.09% could > > be noise? > > small number does't always mean it's noise. When perf samples get_partial_node > on the spin lock call chain, its subroutines (spin lock) are hotter, so > the proportion of subroutine execution is higher. If the function - > get_partial_node itself (excluding subroutines) executes very quickly, > the proportion is lower. > > I also expend the perf data with call chain: > > * w/o fix: > > We can calculate the proportion of spin locks introduced by get_partial_node > is: 31.05% / 49.91% = 62.21% > > 49.91% mmap2_processes [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > | > --49.91%--native_queued_spin_lock_slowpath > | > --49.91%--_raw_spin_lock_irqsave > | > |--31.05%--get_partial_node > | | > | |--23.66%--get_any_partial > | | ___slab_alloc > | | > | --7.40%--___slab_alloc > | __kmem_cache_alloc_bulk > | > |--10.84%--barn_get_empty_sheaf > | | > | |--6.18%--__kfree_rcu_sheaf > | | kvfree_call_rcu > | | > | --4.66%--__pcs_replace_empty_main > | kmem_cache_alloc_noprof > | > |--5.10%--barn_put_empty_sheaf > | | > | --5.09%--__pcs_replace_empty_main > | kmem_cache_alloc_noprof > | > |--2.01%--barn_replace_empty_sheaf > | __pcs_replace_empty_main > | kmem_cache_alloc_noprof > | > --0.78%--__put_partials > | > --0.78%--__kmem_cache_free_bulk.part.0 > rcu_free_sheaf > > > * with fix: > > Similarly, the proportion of spin locks introduced by get_partial_node > is: 39.91% / 42.82% = 93.20% > > 42.82% mmap2_processes [kernel.vmlinux] [k] native_queued_spin_lock_slowpath > | > ---native_queued_spin_lock_slowpath > | > --42.82%--_raw_spin_lock_irqsave > | > |--39.91%--get_partial_node > | | > | |--28.25%--get_any_partial > | | ___slab_alloc > | | > | --11.66%--___slab_alloc > | __kmem_cache_alloc_bulk > | > |--1.09%--barn_get_empty_sheaf > | | > | --0.90%--__kfree_rcu_sheaf > | kvfree_call_rcu > | > |--0.96%--barn_replace_empty_sheaf > | __pcs_replace_empty_main > | kmem_cache_alloc_noprof > | > --0.77%--__put_partials > __kmem_cache_free_bulk.part.0 > rcu_free_sheaf > > > So, 62.21% -> 93.20% could reflect that get_partial_node contribute more > overhead at this point. Thanks for the detailed notes. I'll try to reproduce it to see what exactly happened. -- Thanks, Hao > > > > So, I'd like to know if you think dynamically or adaptively adjusting > > > capacity is a worthwhile idea. > > > > In the followup series, there will be automatically determined capacity to > > roughly match the current capacity of cpu partial slabs: > > > > https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/ > > > > We can use that as starting point for further tuning. But I suspect making > > it adjust dynamically would be complicated. > > Thanks, will continue to evaluate this series. > > Regards, > Zhao > >
On Mon, Dec 15, 2025 at 03:30:48PM +0100, Vlastimil Babka wrote:
> On 12/10/25 01:26, Hao Li wrote:
> > From: Hao Li <haolee.swjtu@gmail.com>
> >
> > When __pcs_replace_empty_main() fails to obtain a full sheaf directly
> > from the barn, it may either:
> >
> > - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
> > - Allocate a brand new full sheaf via alloc_full_sheaf().
> >
> > After reacquiring the per-CPU lock, if pcs->main is still empty and
> > pcs->spare is NULL, the current code donates the empty main sheaf to
> > the barn via barn_put_empty_sheaf() and installs the full sheaf as
> > pcs->main, leaving pcs->spare unpopulated.
> >
> > Instead, keep the existing empty main sheaf locally as the spare:
> >
> > pcs->spare = pcs->main;
> > pcs->main = full;
> >
> > This populates pcs->spare earlier, which can reduce future barn traffic.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
> > ---
> >
> > The Gmail account(haoli.tcs) I used to send v1 of the patch has been
> > restricted from sending emails for unknown reasons, so I'm sending v2
> > from this address instead. Thanks.
> >
> > mm/slub.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a0b905c2a557..a3e73ebb0cc8 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> > */
> >
> > if (pcs->main->size == 0) {
> > + if (!pcs->spare) {
> > + pcs->spare = pcs->main;
> > + pcs->main = full;
> > + return pcs;
> > + }
> > barn_put_empty_sheaf(barn, pcs->main);
> > pcs->main = full;
> > return pcs;
>
> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> */
>
> if (pcs->main->size == 0) {
> - barn_put_empty_sheaf(barn, pcs->main);
> + if (!pcs->spare) {
> + pcs->spare = pcs->main;
> + } else {
> + barn_put_empty_sheaf(barn, pcs->main);
> + }
nit: no braces for single statement?
> pcs->main = full;
> return pcs;
> }
Otherwise LGTM, so:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
--
Cheers,
Harry / Hyeonggon
On 12/22/25 11:20, Harry Yoo wrote:
> On Mon, Dec 15, 2025 at 03:30:48PM +0100, Vlastimil Babka wrote:
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>> */
>>
>> if (pcs->main->size == 0) {
>> - barn_put_empty_sheaf(barn, pcs->main);
>> + if (!pcs->spare) {
>> + pcs->spare = pcs->main;
>> + } else {
>> + barn_put_empty_sheaf(barn, pcs->main);
>> + }
>
> nit: no braces for single statement?
Right, fixed up.
>> pcs->main = full;
>> return pcs;
>> }
>
> Otherwise LGTM, so:
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Thanks!
On Mon, Dec 15, 2025 at 10:30 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/25 01:26, Hao Li wrote:
> > From: Hao Li <haolee.swjtu@gmail.com>
> >
> > When __pcs_replace_empty_main() fails to obtain a full sheaf directly
> > from the barn, it may either:
> >
> > - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
> > - Allocate a brand new full sheaf via alloc_full_sheaf().
> >
> > After reacquiring the per-CPU lock, if pcs->main is still empty and
> > pcs->spare is NULL, the current code donates the empty main sheaf to
> > the barn via barn_put_empty_sheaf() and installs the full sheaf as
> > pcs->main, leaving pcs->spare unpopulated.
> >
> > Instead, keep the existing empty main sheaf locally as the spare:
> >
> > pcs->spare = pcs->main;
> > pcs->main = full;
> >
> > This populates pcs->spare earlier, which can reduce future barn traffic.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
> > ---
> >
> > The Gmail account(haoli.tcs) I used to send v1 of the patch has been
> > restricted from sending emails for unknown reasons, so I'm sending v2
> > from this address instead. Thanks.
> >
> > mm/slub.c | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a0b905c2a557..a3e73ebb0cc8 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> > */
> >
> > if (pcs->main->size == 0) {
> > + if (!pcs->spare) {
> > + pcs->spare = pcs->main;
> > + pcs->main = full;
> > + return pcs;
> > + }
> > barn_put_empty_sheaf(barn, pcs->main);
> > pcs->main = full;
> > return pcs;
>
> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
Thanks!
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> */
>
> if (pcs->main->size == 0) {
> - barn_put_empty_sheaf(barn, pcs->main);
> + if (!pcs->spare) {
> + pcs->spare = pcs->main;
> + } else {
> + barn_put_empty_sheaf(barn, pcs->main);
> + }
Nice simplification. Thanks!
> pcs->main = full;
> return pcs;
> }
>
>
>
© 2016 - 2026 Red Hat, Inc.