slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

[PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Hao Li 2 months ago

From: Hao Li <haolee.swjtu@gmail.com>

When __pcs_replace_empty_main() fails to obtain a full sheaf directly
from the barn, it may either:

  - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
  - Allocate a brand new full sheaf via alloc_full_sheaf().

After reacquiring the per-CPU lock, if pcs->main is still empty and
pcs->spare is NULL, the current code donates the empty main sheaf to
the barn via barn_put_empty_sheaf() and installs the full sheaf as
pcs->main, leaving pcs->spare unpopulated.

Instead, keep the existing empty main sheaf locally as the spare:

  pcs->spare = pcs->main;
  pcs->main = full;

This populates pcs->spare earlier, which can reduce future barn traffic.

Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
---

The Gmail account(haoli.tcs) I used to send v1 of the patch has been
restricted from sending emails for unknown reasons, so I'm sending v2
from this address instead. Thanks.

 mm/slub.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index a0b905c2a557..a3e73ebb0cc8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	 */
 
 	if (pcs->main->size == 0) {
+		if (!pcs->spare) {
+			pcs->spare = pcs->main;
+			pcs->main = full;
+			return pcs;
+		}
 		barn_put_empty_sheaf(barn, pcs->main);
 		pcs->main = full;
 		return pcs;
-- 
2.50.1

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Vlastimil Babka 1 month, 3 weeks ago

On 12/10/25 01:26, Hao Li wrote:
> From: Hao Li <haolee.swjtu@gmail.com>
> 
> When __pcs_replace_empty_main() fails to obtain a full sheaf directly
> from the barn, it may either:
> 
>   - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
>   - Allocate a brand new full sheaf via alloc_full_sheaf().
> 
> After reacquiring the per-CPU lock, if pcs->main is still empty and
> pcs->spare is NULL, the current code donates the empty main sheaf to
> the barn via barn_put_empty_sheaf() and installs the full sheaf as
> pcs->main, leaving pcs->spare unpopulated.
> 
> Instead, keep the existing empty main sheaf locally as the spare:
> 
>   pcs->spare = pcs->main;
>   pcs->main = full;
> 
> This populates pcs->spare earlier, which can reduce future barn traffic.
> 
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
> ---
> 
> The Gmail account(haoli.tcs) I used to send v1 of the patch has been
> restricted from sending emails for unknown reasons, so I'm sending v2
> from this address instead. Thanks.
> 
>  mm/slub.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index a0b905c2a557..a3e73ebb0cc8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>  	 */
>  
>  	if (pcs->main->size == 0) {
> +		if (!pcs->spare) {
> +			pcs->spare = pcs->main;
> +			pcs->main = full;
> +			return pcs;
> +		}
>  		barn_put_empty_sheaf(barn, pcs->main);
>  		pcs->main = full;
>  		return pcs;

Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
adjusted like this:

diff --git a/mm/slub.c b/mm/slub.c
index f21b2f0c6f5a..ad71f01571f0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
         */
 
        if (pcs->main->size == 0) {
-               barn_put_empty_sheaf(barn, pcs->main);
+               if (!pcs->spare) {
+                       pcs->spare = pcs->main;
+               } else {
+                       barn_put_empty_sheaf(barn, pcs->main);
+               }
                pcs->main = full;
                return pcs;
        }

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Zhao Liu 3 weeks, 3 days ago

Hi Babka & Hao,

> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>          */
>  
>         if (pcs->main->size == 0) {
> -               barn_put_empty_sheaf(barn, pcs->main);
> +               if (!pcs->spare) {
> +                       pcs->spare = pcs->main;
> +               } else {
> +                       barn_put_empty_sheaf(barn, pcs->main);
> +               }
>                 pcs->main = full;
>                 return pcs;
>         }

I noticed the previous lkp regression report and tested this fix:

* will-it-scale.per_process_ops

Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
results:

nr_tasks   Delta
1          + 3.593%
8          + 3.094%
64         +60.247%
128        +49.344%
192        +27.500%
256        -12.077%

For the cases (nr_tasks: 1-192), there're the improvements. I think
this is expected since pre-cached spare sheaf reduces spinlock race:
reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().

So (maybe too late),

Tested-by: Zhao Liu <zhao1.liu@intel.com>



But I find there are two more questions that might need consideration?

# Question 1: Regression for 256 tasks

For the above test - the case with nr_tasks: 256, there's a "slight"
regression. I did more testing:

(This is a single-round test; the 256-tasks data has jitter.)

nr_tasks   Delta
244	     0.308%
248	   - 0.805%
252	    12.070%
256	   -11.441%
258	     2.070%
260	     1.252%
264	     2.369%
268	   -11.479%
272	     2.130%
292	     8.714%
296	    10.905%
298	    17.196%
300	    11.783%
302	     6.620%
304	     3.112%
308	   - 5.924%

It can be seen that most cases show improvement, though a few may
experience slight regression.

Based on the configuration of my machine:

    GNR - 2 sockets with the following NUMA topology:

    NUMA:
      NUMA node(s):              4
      NUMA node0 CPU(s):         0-42,172-214
      NUMA node1 CPU(s):         43-85,215-257
      NUMA node2 CPU(s):         86-128,258-300
      NUMA node3 CPU(s):         129-171,301-343

Since I set the CPU affinity on the core, 256 cases is roughly
equivalent to the moment when Node 0 and Node 1 are filled.

The following is the perf data comparing 2 tests w/o fix & with this fix:

# Baseline  Delta Abs  Shared Object            Symbol
# ........  .........  .......................  ....................................
#
    61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
     0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
     0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
     1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
     3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
     1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
     0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
     0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
     1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
     1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
     1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
     0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
     0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
     0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
     0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
     0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
     0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
     0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
     0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
     0.53%     -0.06%  libc.so.6                [.] __mmap
     0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
     0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
     0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
     0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
     0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
     0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
     0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
     0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
     0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
     0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
     0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
     0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
     0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
     0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
     0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
     0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
     0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
     0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
     0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
     0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
     0.49%     -0.04%  libc.so.6                [.] __munmap
     0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
     0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
     0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
     0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
     0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
     0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
     0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
     0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
     0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
     0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
     0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
     0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
     0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
     0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
     0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
     0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc

I think the insteresting item is "get_partial_node". It seems this fix
makes "get_partial_node" slightly more frequent. HMM, however, I still
can't figure out why this is happening. Do you have any thoughts on it?

# Question 2: sheaf capacity

Back the original commit which triggerred lkp regression. I did more
testing to check if this fix could totally fill the regression gap.

The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
use percpu sheaves for maple_node_cache") has the regression.

I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
line:

nr_tasks	w/o fix         with fix
1		- 3.643%	- 0.181%
8		-12.523%	- 9.816%
64		-50.378%	-20.482%
128		-36.736%	- 5.518%
192		-22.963%	- 1.777%
256		-32.926%	- 41.026%

It appears that under extreme conditions, regression remains significate.
I remembered your suggestion about larger capacity and did the following
testing:

	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4
			(with this fix)	(cap: 32->64)	(cap: 32->128)	(cap: 32->256)
1	-8.789%		-8.805%		-8.185%		-9.912%		-8.673%
8	-12.256%	-9.219%		-10.460%	-10.070%	-8.819%
64	-38.915%	-8.172%		-4.700%		4.571%		8.793%
128	-8.032%		11.377%		23.232%		26.940%		30.573%
192	-1.220%		9.758%		20.573%		22.645%		25.768%
256	-6.570%		9.967%		21.663%		30.103%		33.876%

Comparing with base line (3accabda4), larger capacity could
significatly improve the Sheaf's scalability.

So, I'd like to know if you think dynamically or adaptively adjusting
capacity is a worthwhile idea.

Thanks for your patience.

Regards,
Zhao

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Hao Li 2 weeks, 6 days ago

On Thu, Jan 15, 2026 at 06:12:44PM +0800, Zhao Liu wrote:
> Hi Babka & Hao,
> 
> > Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> > adjusted like this:
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f21b2f0c6f5a..ad71f01571f0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> >          */
> >  
> >         if (pcs->main->size == 0) {
> > -               barn_put_empty_sheaf(barn, pcs->main);
> > +               if (!pcs->spare) {
> > +                       pcs->spare = pcs->main;
> > +               } else {
> > +                       barn_put_empty_sheaf(barn, pcs->main);
> > +               }
> >                 pcs->main = full;
> >                 return pcs;
> >         }
> 
> I noticed the previous lkp regression report and tested this fix:
> 
> * will-it-scale.per_process_ops
> 
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
> 
> nr_tasks   Delta
> 1          + 3.593%
> 8          + 3.094%
> 64         +60.247%
> 128        +49.344%
> 192        +27.500%
> 256        -12.077%
> 
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
> 
> So (maybe too late),
> 
> Tested-by: Zhao Liu <zhao1.liu@intel.com>
> 
> 
> 
> But I find there are two more questions that might need consideration?
> 
> # Question 1: Regression for 256 tasks
> 
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
> 
> (This is a single-round test; the 256-tasks data has jitter.)
> 
> nr_tasks   Delta
> 244	     0.308%
> 248	   - 0.805%
> 252	    12.070%
> 256	   -11.441%
> 258	     2.070%
> 260	     1.252%
> 264	     2.369%
> 268	   -11.479%
> 272	     2.130%
> 292	     8.714%
> 296	    10.905%
> 298	    17.196%
> 300	    11.783%
> 302	     6.620%
> 304	     3.112%
> 308	   - 5.924%
> 
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
> 
> Based on the configuration of my machine:
> 
>     GNR - 2 sockets with the following NUMA topology:
> 
>     NUMA:
>       NUMA node(s):              4
>       NUMA node0 CPU(s):         0-42,172-214
>       NUMA node1 CPU(s):         43-85,215-257
>       NUMA node2 CPU(s):         86-128,258-300
>       NUMA node3 CPU(s):         129-171,301-343
> 
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
> 
> The following is the perf data comparing 2 tests w/o fix & with this fix:
> 
> # Baseline  Delta Abs  Shared Object            Symbol
> # ........  .........  .......................  ....................................
> #
>     61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>      0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
>      0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
>      1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
>      3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
>      1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
>      0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
>      0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
>      1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
>      1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
>      1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
>      0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
>      0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
>      0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
>      0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
>      0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
>      0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
>      0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
>      0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
>      0.53%     -0.06%  libc.so.6                [.] __mmap
>      0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
>      0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
>      0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
>      0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
>      0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
>      0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
>      0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>      0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
>      0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
>      0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
>      0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
>      0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
>      0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
>      0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
>      0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
>      0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
>      0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
>      0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
>      0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
>      0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
>      0.49%     -0.04%  libc.so.6                [.] __munmap
>      0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
>      0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
>      0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
>      0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
>      0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
>      0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
>      0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
>      0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
>      0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
>      0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
>      0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
>      0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
>      0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
>      0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc
> 
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?

Hello, Zhao,

I tested the performance degradation issue we discussed concerning nr_tasks=256.
However, my results differ from yours, so I'd like to share my setup and
findings for clarity and comparison:

1. Machine Configuration

The topology of my machine is as follows:

CPU(s):              384
On-line CPU(s) list: 0-383
Thread(s) per core:  2
Core(s) per socket:  96
Socket(s):           2
NUMA node(s):        2

Since my machine only has 192 cores when counting physical cores, I had to
enable SMT to support the higher number of tasks in the LKP test cases. My
configuration was as follows:

will-it-scale:
  mode: process
  test: mmap2
  no_affinity: 0
  smt: 1

The sequence of test cases I used was: nr_tasks= 1, 8, 64, 128, 192, 256, 384.

I noticed that your test command did not enable SMT, but I believe this
difference should not significantly affect the results. I wanted to highlight
this to ensure we account for any potential impact these differences might have
on our results.

2. Kernel Configuration

I conducted tests using the commit f0b9d8eb98dfee8d00419aa07543bdc2c1a44fb1
first, then applied the patch and tested again.

Each test was run 10 times, and I took the average results.

3. Test Results (Without Patch vs. With Patch)

will-it-scale.1.processes -1.27%
will-it-scale.8.processes +0.19%
will-it-scale.64.processes +25.81%
will-it-scale.128.processes +112.88%
will-it-scale.192.processes +157.42%
will-it-scale.256.processes +70.63%
will-it-scale.384.processes +132.12%
will-it-scale.per_process_ops +27.21%
will-it-scale.scalability +135.10%
will-it-scale.time.involuntary_context_switches +127.54%
will-it-scale.time.voluntary_context_switches +0.01%
will-it-scale.workload +94.47%

From the above results, it appears that the patch improved performance across
the board.


4. Further Analysis

I conducted additional tests by running "./mmap2_processes -t 384 -s 25 -m" both
without and with the patch, and sampled the results using perf.

Here's the "perf report --no-children -g" output without the patch:

```
-   65.72%  mmap2_processes  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath                                                                                                  
   - 55.33% testcase
      - 55.33% __mmap                                                                                                                                                                    
         - 55.32% entry_SYSCALL_64_after_hwframe
            - do_syscall_64
               - 55.30% ksys_mmap_pgoff
                  - 55.30% vm_mmap_pgoff
                     - 55.28% do_mmap
                        - 55.24% __mmap_region
                           - 44.35% mas_preallocate
                              - 44.34% mas_alloc_nodes
                                 - 44.34% kmem_cache_alloc_noprof
                                    - 44.33% __pcs_replace_empty_main
                                       + 21.23% barn_put_empty_sheaf
                                       + 15.95% barn_get_empty_sheaf
                                       + 5.50% barn_replace_empty_sheaf
                                       + 1.33% _raw_spin_unlock_irqrestore
                           + 10.24% mas_store_prealloc
                           + 0.56% perf_event_mmap
   - 10.38% __munmap
      - 10.38% entry_SYSCALL_64_after_hwframe
         - do_syscall_64
            - 10.36% __x64_sys_munmap
               - 10.36% __vm_munmap
                  - 10.36% do_vmi_munmap
                     - 10.35% do_vmi_align_munmap
                        - 10.14% mas_store_gfp
                           - 10.13% mas_wr_node_store
                              - 10.09% kvfree_call_rcu
                                 - 10.09% __kfree_rcu_sheaf
                                    - 10.08% barn_get_empty_sheaf
                                       + 9.17% _raw_spin_lock_irqsave
                                       + 0.90% _raw_spin_unlock_irqrestore
```

Here's the "perf report --no-children -g" output with the patch:

```
+   30.36%  mmap2_processes  [kernel.kallsyms]     [k] perf_iterate_ctx
-   28.80%  mmap2_processes  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath
   - 24.72% testcase
      - 24.71% __mmap
         - 24.68% entry_SYSCALL_64_after_hwframe
            - do_syscall_64
               - 24.61% ksys_mmap_pgoff
                  - 24.57% vm_mmap_pgoff
                     - 24.51% do_mmap
                        - 24.30% __mmap_region
                           - 18.33% mas_preallocate
                              - 18.30% mas_alloc_nodes
                                 - 18.30% kmem_cache_alloc_noprof
                                    - 18.28% __pcs_replace_empty_main
                                       + 9.06% barn_replace_empty_sheaf
                                       + 6.12% barn_get_empty_sheaf
                                       + 3.09% refill_sheaf
                           + 2.94% mas_store_prealloc
                           + 2.64% perf_event_mmap
   - 4.07% __munmap
      - 4.04% entry_SYSCALL_64_after_hwframe
         - do_syscall_64
            - 3.98% __x64_sys_munmap
               - 3.98% __vm_munmap
                  - 3.95% do_vmi_munmap
                     - 3.91% do_vmi_align_munmap
                        - 2.98% mas_store_gfp
                           - 2.90% mas_wr_node_store
                              - 2.75% kvfree_call_rcu
                                 - 2.73% __kfree_rcu_sheaf
                                    - 2.71% barn_get_empty_sheaf
                                       + 1.68% _raw_spin_lock_irqsave
                                       + 1.03% _raw_spin_unlock_irqrestore
                        - 0.76% vms_complete_munmap_vmas
                             0.67% vms_clear_ptes.part.41
```

Using perf diff, I compared the results before and after applying the patch:

```
# Event 'cycles:P'
#
# Baseline  Delta Abs  Shared Object         Symbol
# ........  .........  ....................  ..................................................
#
    65.72%    -36.92%  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath
    14.65%    +15.70%  [kernel.kallsyms]     [k] perf_iterate_ctx
     2.10%     +2.45%  [kernel.kallsyms]     [k] unmap_page_range
     1.09%     +1.26%  [kernel.kallsyms]     [k] mas_wr_node_store
     1.01%     +1.14%  [kernel.kallsyms]     [k] free_pgd_range
     0.84%     +0.92%  [kernel.kallsyms]     [k] __mmap_region
     0.50%     +0.76%  [kernel.kallsyms]     [k] memcpy
     0.62%     +0.63%  [kernel.kallsyms]     [k] __cond_resched
     0.49%     +0.51%  [kernel.kallsyms]     [k] mas_walk
     0.39%     +0.42%  [kernel.kallsyms]     [k] mas_empty_area_rev
     0.32%     +0.40%  [kernel.kallsyms]     [k] mas_next_slot
     0.34%     +0.39%  [kernel.kallsyms]     [k] refill_sheaf
     0.26%     +0.36%  [kernel.kallsyms]     [k] mas_prev_slot
     0.24%     +0.29%  [kernel.kallsyms]     [k] do_syscall_64
     0.25%     +0.28%  [kernel.kallsyms]     [k] mas_find
     0.20%     +0.28%  [kernel.kallsyms]     [k] kmem_cache_alloc_noprof
     0.24%     +0.27%  [kernel.kallsyms]     [k] strlen
     0.26%     +0.27%  [kernel.kallsyms]     [k] perf_event_mmap
     0.25%     +0.26%  [kernel.kallsyms]     [k] do_mmap
     0.22%     +0.25%  [kernel.kallsyms]     [k] mas_store_gfp
     0.25%     +0.24%  [kernel.kallsyms]     [k] mas_leaf_max_gap
```


I also sampled the execution counts of several key functions using bpftrace.

Without Patch:

```
@cnt[barn_put_empty_sheaf]: 38833037
@cnt[barn_replace_empty_sheaf]: 41883891
@cnt[__pcs_replace_empty_main]: 41884885
@cnt[barn_get_empty_sheaf]: 75422518
@cnt[mmap]: 489634255
```

With Patch:

```
@cnt[barn_put_empty_sheaf]: 2382910
@cnt[barn_replace_empty_sheaf]: 90681637
@cnt[__pcs_replace_empty_main]: 90683656
@cnt[barn_get_empty_sheaf]: 82710919
@cnt[mmap]: 1113853385
```

From the above results, I found that the execution count of the
barn_put_empty_sheaf function dropped by an order of magnitude after applying
the patch. This is likely due to the patch's effect: when pcs->spare is NULL,
the empty sheaf is cached in pcs->spare instead of calling barn_put_empty_sheaf.
This reduces contention on the barn spinlock significantly.

At the same time, I noticed that the execution counts for
barn_replace_empty_sheaf and __pcs_replace_empty_main increased, but their
proportion in the perf sampling decreased. This suggests that the average
execution time for these functions has decreased.

Moreover, the total number of mmap executions after applying the patch
(1113853385) is more than double that of the unpatched kernel (489634255). This
further supports our analysis: since the test case duration is fixed at 25
seconds, the patched kernel runs faster, resulting in more iterations of the
test case and more mmap executions, which in turn increases the frequency of
these functions being called.

Based on my tests, everything appears reasonable and explainable. However, I
couldn't reproduce the performance drop for nr_tasks=256, and it's unclear why
our results differ. I'd appreciate it if you could share any additional insights
or thoughts on what might be causing this discrepancy. If needed, we could also
consult Vlastimil for further suggestions to better understand the issue or
explore other potential factors.

Thanks!

-- 
Thanks,
Hao

> 
> # Question 2: sheaf capacity
> 
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
> 
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
> 
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
> 
> nr_tasks	w/o fix         with fix
> 1		- 3.643%	- 0.181%
> 8		-12.523%	- 9.816%
> 64		-50.378%	-20.482%
> 128		-36.736%	- 5.518%
> 192		-22.963%	- 1.777%
> 256		-32.926%	- 41.026%
> 
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
> 
> 	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4
> 			(with this fix)	(cap: 32->64)	(cap: 32->128)	(cap: 32->256)
> 1	-8.789%		-8.805%		-8.185%		-9.912%		-8.673%
> 8	-12.256%	-9.219%		-10.460%	-10.070%	-8.819%
> 64	-38.915%	-8.172%		-4.700%		4.571%		8.793%
> 128	-8.032%		11.377%		23.232%		26.940%		30.573%
> 192	-1.220%		9.758%		20.573%		22.645%		25.768%
> 256	-6.570%		9.967%		21.663%		30.103%		33.876%
> 
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
> 
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.
> 
> Thanks for your patience.
> 
> Regards,
> Zhao
> 
>

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Zhao Liu 2 weeks, 5 days ago

> 1. Machine Configuration
> 
> The topology of my machine is as follows:
> 
> CPU(s):              384
> On-line CPU(s) list: 0-383
> Thread(s) per core:  2
> Core(s) per socket:  96
> Socket(s):           2
> NUMA node(s):        2

It seems like this is a GNR machine - maybe SNC could be enabled.

> Since my machine only has 192 cores when counting physical cores, I had to
> enable SMT to support the higher number of tasks in the LKP test cases. My
> configuration was as follows:
> 
> will-it-scale:
>   mode: process
>   test: mmap2
>   no_affinity: 0
>   smt: 1

For lkp, smt parameter is disabled. I tried with smt=1 locally, the
difference between "with fix" & "w/o fix" is not significate. Maybe smt
parameter could be set as 0.

On another machine (2 sockets with SNC3 enabled - 6 NUMA nodes), there's
the similar regression happening when tasks fill up a socket and then
there're more get_partial_node().

> Here's the "perf report --no-children -g" output with the patch:
> 
> ```
> +   30.36%  mmap2_processes  [kernel.kallsyms]     [k] perf_iterate_ctx
> -   28.80%  mmap2_processes  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath
>    - 24.72% testcase
>       - 24.71% __mmap
>          - 24.68% entry_SYSCALL_64_after_hwframe
>             - do_syscall_64
>                - 24.61% ksys_mmap_pgoff
>                   - 24.57% vm_mmap_pgoff
>                      - 24.51% do_mmap
>                         - 24.30% __mmap_region
>                            - 18.33% mas_preallocate
>                               - 18.30% mas_alloc_nodes
>                                  - 18.30% kmem_cache_alloc_noprof
>                                     - 18.28% __pcs_replace_empty_main
>                                        + 9.06% barn_replace_empty_sheaf
>                                        + 6.12% barn_get_empty_sheaf
>                                        + 3.09% refill_sheaf

this is the difference with my previous perf report: here the proportion
of refill_sheaf is low - it indicates the shaeves are enough in the most
time.

Back to my previous test, I'm guessing that with this fix, under extreme
conditions of massive mmap usage, each CPU now stores an empty spare sheaf
locally. Previously, each CPU's spare sheaf was NULL. So memory pressure
increases with more spare sheaves locally. And in that extreme scenario,
cross-socket remote NUMA access incurs significant overhead — which is why
regression occurs here.

However, testing from 1 task to max tasks (nr_tasks = nr_logical_cpus)
shows overall significant improvements in most scenarios. Regressions
only occur at the specific topology boundaries described above.

I believe the cases with performance gains are more common. So I think
the regression is a corner case. If it does indeed impact certain
workloads in the future, we may need to reconsider optimization at that
time. It can now be used as a reference.

Thanks,
Zhao

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Hao Li 2 weeks, 4 days ago

On Tue, Jan 20, 2026 at 04:21:16PM +0800, Zhao Liu wrote:

Hi, Zhao,

Thanks again for your thorough testing and detailed feedback - I really
appreciate your help.

> > 1. Machine Configuration
> > 
> > The topology of my machine is as follows:
> > 
> > CPU(s):              384
> > On-line CPU(s) list: 0-383
> > Thread(s) per core:  2
> > Core(s) per socket:  96
> > Socket(s):           2
> > NUMA node(s):        2
> 
> It seems like this is a GNR machine - maybe SNC could be enabled.

Actually, my cpu is AMD EPYC 96-Core Processor. SNC is disabled, and
there's only one NUMA node per socket.

> 
> > Since my machine only has 192 cores when counting physical cores, I had to
> > enable SMT to support the higher number of tasks in the LKP test cases. My
> > configuration was as follows:
> > 
> > will-it-scale:
> >   mode: process
> >   test: mmap2
> >   no_affinity: 0
> >   smt: 1
> 
> For lkp, smt parameter is disabled. I tried with smt=1 locally, the
> difference between "with fix" & "w/o fix" is not significate. Maybe smt
> parameter could be set as 0.

Just to confirm: do you mean that on your machine, when smt=1, the performance
difference between "with fix" and "without fix" is not significant - regardless
of whether it's a gain or regression? Thanks.

> 
> On another machine (2 sockets with SNC3 enabled - 6 NUMA nodes), there's
> the similar regression happening when tasks fill up a socket and then
> there're more get_partial_node().

From a theoretical standpoint, it seems like having more nodes should reduce
lock contention, not increase it...

By the way, I wanted to confirm one thing: in your earlier perf data, I noticed
that the sampling ratio of native_queued_spin_lock_slowpath and get_partial_node
slightly increased with the patch. Does this suggest that the lock contention
you're observing mainly comes from kmem_cache_node->list_lock rather than
node_barn->lock?

If possible, could you help confirm this using "perf report -g" to see where the
contention is coming from?

> 
> > Here's the "perf report --no-children -g" output with the patch:
> > 
> > ```
> > +   30.36%  mmap2_processes  [kernel.kallsyms]     [k] perf_iterate_ctx
> > -   28.80%  mmap2_processes  [kernel.kallsyms]     [k] native_queued_spin_lock_slowpath
> >    - 24.72% testcase
> >       - 24.71% __mmap
> >          - 24.68% entry_SYSCALL_64_after_hwframe
> >             - do_syscall_64
> >                - 24.61% ksys_mmap_pgoff
> >                   - 24.57% vm_mmap_pgoff
> >                      - 24.51% do_mmap
> >                         - 24.30% __mmap_region
> >                            - 18.33% mas_preallocate
> >                               - 18.30% mas_alloc_nodes
> >                                  - 18.30% kmem_cache_alloc_noprof
> >                                     - 18.28% __pcs_replace_empty_main
> >                                        + 9.06% barn_replace_empty_sheaf
> >                                        + 6.12% barn_get_empty_sheaf
> >                                        + 3.09% refill_sheaf
> 
> this is the difference with my previous perf report: here the proportion
> of refill_sheaf is low - it indicates the shaeves are enough in the most
> time.
> 
> Back to my previous test, I'm guessing that with this fix, under extreme
> conditions of massive mmap usage, each CPU now stores an empty spare sheaf
> locally. Previously, each CPU's spare sheaf was NULL. So memory pressure
> increases with more spare sheaves locally.

I'm not quite sure about this point - my intuition is that this shouldn't
consume a significant amount of memory.

> And in that extreme scenario,
> cross-socket remote NUMA access incurs significant overhead — which is why
> regression occurs here.

This part I haven't fully figured out yet - still looking into it.

> 
> However, testing from 1 task to max tasks (nr_tasks = nr_logical_cpus)
> shows overall significant improvements in most scenarios. Regressions
> only occur at the specific topology boundaries described above.

It does look like there's some underlying factor at play, triggering a
performance tipping point. Though I haven't yet figured out the exact pattern.

> 
> I believe the cases with performance gains are more common. So I think
> the regression is a corner case. If it does indeed impact certain
> workloads in the future, we may need to reconsider optimization at that
> time. It can now be used as a reference.

Agreed — this seems to be a corner case, and your test results have been really
helpful as a reference. Thanks again for the great support and insightful
discussion.

-- 
Thanks,
Hao

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Zhao Liu 2 weeks, 4 days ago

> Thanks again for your thorough testing and detailed feedback - I really
> appreciate your help.

You're welcome and thanks for your patinece!

> > It seems like this is a GNR machine - maybe SNC could be enabled.
> 
> Actually, my cpu is AMD EPYC 96-Core Processor. SNC is disabled, and
> there's only one NUMA node per socket.

That's interesting.

> > For lkp, smt parameter is disabled. I tried with smt=1 locally, the
> > difference between "with fix" & "w/o fix" is not significate. Maybe smt
> > parameter could be set as 0.
> 
> Just to confirm: do you mean that on your machine, when smt=1, the performance
> difference between "with fix" and "without fix" is not significant - regardless
> of whether it's a gain or regression? Thanks.

Yes, that's what I found on my machine. Given that you're using an AMD machine,
performance differences arise due to hardware difference :).

> > On another machine (2 sockets with SNC3 enabled - 6 NUMA nodes), there's
> > the similar regression happening when tasks fill up a socket and then
> > there're more get_partial_node().
> 
> From a theoretical standpoint, it seems like having more nodes should reduce
> lock contention, not increase it...
> 
> By the way, I wanted to confirm one thing: in your earlier perf data, I noticed
> that the sampling ratio of native_queued_spin_lock_slowpath and get_partial_node
> slightly increased with the patch. Does this suggest that the lock contention
> you're observing mainly comes from kmem_cache_node->list_lock rather than
> node_barn->lock?

Yes, I think so.

> If possible, could you help confirm this using "perf report -g" to see where the
> contention is coming from?

No problem,

-   42.82%    42.82%  mmap2_processes  [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath                                                      ▒
   - 42.17% __mmap                                                                                                                                         ▒
      - 42.17% entry_SYSCALL_64_after_hwframe                                                                                                              ▒
         - do_syscall_64                                                                                                                                   ▒
            - 42.16% ksys_mmap_pgoff                                                                                                                       ▒
               - 42.16% vm_mmap_pgoff                                                                                                                      ▒
                  - 42.15% do_mmap                                                                                                                         ▒
                     - 42.14% __mmap_region                                                                                                                ▒
                        - 42.09% __mmap_new_vma                                                                                                            ▒
                           - 41.59% mas_preallocate                                                                                                        ▒
                              - 41.59% kmem_cache_alloc_noprof                                                                                             ▒
                                 - 41.58% __pcs_replace_empty_main                                                                                         ▒
                                    - 40.38% __kmem_cache_alloc_bulk                                                                                       ▒
                                       - 40.38% ___slab_alloc                                                                                              ▒
                                          - 28.62% get_any_partial                                                                                         ▒
                                             - 28.61% get_partial_node                                                                                     ▒
                                                + 28.25% _raw_spin_lock_irqsave                                                                            ▒
                                          - 11.76% get_partial_node                                                                                        ▒
                                             + 11.66% _raw_spin_lock_irqsave                                                                               ▒
                                    - 1.00% barn_replace_empty_sheaf                                                                                       ▒
                                       + 0.95% _raw_spin_lock_irqsave                                                                                      ▒
   + 0.65% __munmap                                                                                            

> > Back to my previous test, I'm guessing that with this fix, under extreme
> > conditions of massive mmap usage, each CPU now stores an empty spare sheaf
> > locally. Previously, each CPU's spare sheaf was NULL. So memory pressure
> > increases with more spare sheaves locally.
> 
> I'm not quite sure about this point - my intuition is that this shouldn't
> consume a significant amount of memory.
>
> > And in that extreme scenario,
> > cross-socket remote NUMA access incurs significant overhead — which is why
> > regression occurs here.
> 
> This part I haven't fully figured out yet - still looking into it.

This part is hard to say; it could also be due to certain differences in
the hardware itself so that your machine didn't meet.

> > However, testing from 1 task to max tasks (nr_tasks = nr_logical_cpus)
> > shows overall significant improvements in most scenarios. Regressions
> > only occur at the specific topology boundaries described above.
> 
> It does look like there's some underlying factor at play, triggering a
> performance tipping point. Though I haven't yet figured out the exact pattern.

For details, on my machines, test where nr_task ranges from 0, 1, 4, 8 all the
way up to max_cpus, and I plot the score curves with and without the fix to
observe how the fix behaves under different conditions.

> > I believe the cases with performance gains are more common. So I think
> > the regression is a corner case. If it does indeed impact certain
> > workloads in the future, we may need to reconsider optimization at that
> > time. It can now be used as a reference.
> 
> Agreed — this seems to be a corner case, and your test results have been really
> helpful as a reference. Thanks again for the great support and insightful
> discussion.

It's been a pleasure communicating with you. :)

Thanks,
Zhao

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Hao Li 3 weeks, 2 days ago

On Thu, Jan 15, 2026 at 06:12:44PM +0800, Zhao Liu wrote:
> Hi Babka & Hao,
> 
> > Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> > adjusted like this:
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f21b2f0c6f5a..ad71f01571f0 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> >          */
> >  
> >         if (pcs->main->size == 0) {
> > -               barn_put_empty_sheaf(barn, pcs->main);
> > +               if (!pcs->spare) {
> > +                       pcs->spare = pcs->main;
> > +               } else {
> > +                       barn_put_empty_sheaf(barn, pcs->main);
> > +               }
> >                 pcs->main = full;
> >                 return pcs;
> >         }
> 
> I noticed the previous lkp regression report and tested this fix:
> 
> * will-it-scale.per_process_ops
> 
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
> 
> nr_tasks   Delta
> 1          + 3.593%
> 8          + 3.094%
> 64         +60.247%
> 128        +49.344%
> 192        +27.500%
> 256        -12.077%
> 
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
> 
> So (maybe too late),
> 
> Tested-by: Zhao Liu <zhao1.liu@intel.com>

Hello, Zhao,

Thanks for running the performance test!

> 
> 
> 
> But I find there are two more questions that might need consideration?
> 
> # Question 1: Regression for 256 tasks
> 
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
> 
> (This is a single-round test; the 256-tasks data has jitter.)
> 
> nr_tasks   Delta
> 244	     0.308%
> 248	   - 0.805%
> 252	    12.070%
> 256	   -11.441%
> 258	     2.070%
> 260	     1.252%
> 264	     2.369%
> 268	   -11.479%
> 272	     2.130%
> 292	     8.714%
> 296	    10.905%
> 298	    17.196%
> 300	    11.783%
> 302	     6.620%
> 304	     3.112%
> 308	   - 5.924%
> 
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
> 
> Based on the configuration of my machine:
> 
>     GNR - 2 sockets with the following NUMA topology:
> 
>     NUMA:
>       NUMA node(s):              4
>       NUMA node0 CPU(s):         0-42,172-214
>       NUMA node1 CPU(s):         43-85,215-257
>       NUMA node2 CPU(s):         86-128,258-300
>       NUMA node3 CPU(s):         129-171,301-343
> 
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
> 
> The following is the perf data comparing 2 tests w/o fix & with this fix:
> 
> # Baseline  Delta Abs  Shared Object            Symbol
> # ........  .........  .......................  ....................................
> #
>     61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>      0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
>      0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
>      1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
>      3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
>      1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
>      0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
>      0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
>      1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
>      1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
>      1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
>      0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
>      0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
>      0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
>      0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
>      0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
>      0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
>      0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
>      0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
>      0.53%     -0.06%  libc.so.6                [.] __mmap
>      0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
>      0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
>      0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
>      0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
>      0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
>      0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
>      0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>      0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
>      0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
>      0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
>      0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
>      0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
>      0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
>      0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
>      0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
>      0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
>      0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
>      0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
>      0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
>      0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
>      0.49%     -0.04%  libc.so.6                [.] __munmap
>      0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
>      0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
>      0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
>      0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
>      0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
>      0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
>      0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
>      0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
>      0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
>      0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
>      0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
>      0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
>      0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
>      0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc
> 
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?

I'd like to dig a bit deeper to confirm whether the "256 tasks" result is truly
a regression. Could you please share the original full report, or let me know
which test case under will-it-scale/ you used?

-- 
Thanks,
Hao

> 
> # Question 2: sheaf capacity
> 
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
> 
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
> 
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
> 
> nr_tasks	w/o fix         with fix
> 1		- 3.643%	- 0.181%
> 8		-12.523%	- 9.816%
> 64		-50.378%	-20.482%
> 128		-36.736%	- 5.518%
> 192		-22.963%	- 1.777%
> 256		-32.926%	- 41.026%
> 
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
> 
> 	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4
> 			(with this fix)	(cap: 32->64)	(cap: 32->128)	(cap: 32->256)
> 1	-8.789%		-8.805%		-8.185%		-9.912%		-8.673%
> 8	-12.256%	-9.219%		-10.460%	-10.070%	-8.819%
> 64	-38.915%	-8.172%		-4.700%		4.571%		8.793%
> 128	-8.032%		11.377%		23.232%		26.940%		30.573%
> 192	-1.220%		9.758%		20.573%		22.645%		25.768%
> 256	-6.570%		9.967%		21.663%		30.103%		33.876%
> 
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
> 
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.
> 
> Thanks for your patience.
> 
> Regards,
> Zhao
> 
>

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Zhao Liu 3 weeks, 2 days ago

> I'd like to dig a bit deeper to confirm whether the "256 tasks" result is truly
> a regression.

The "256" seems align closely with the NUMA topology on my machine, so
I'm unsure how it will perform on other machines.

> Could you please share the original full report, or let me know
> which test case under will-it-scale/ you used?

I mainly followed Suneeth's steps [*]:

1) git clone https://github.com/antonblanchard/will-it-scale.git
2) git clone https://github.com/intel/lkp-tests.git
3) cd will-it-scale && git apply 
lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
4) make
5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256

[*]: https://lore.kernel.org/all/262c742f-dc0c-4adc-b23c-047cd3298a5e@amd.com/

Sine the raw perf.data files are too big to be blocked, if you need to
see any specific part of the content, I can paste the info for you.

Regards,
Zhao

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Hao Li 3 weeks, 2 days ago

On Fri, Jan 16, 2026 at 05:16:03PM +0800, Zhao Liu wrote:
> > I'd like to dig a bit deeper to confirm whether the "256 tasks" result is truly
> > a regression.
> 
> The "256" seems align closely with the NUMA topology on my machine, so
> I'm unsure how it will perform on other machines.

Got it. Thanks, in any case, I'll try to reproduce it first.
> 
> > Could you please share the original full report, or let me know
> > which test case under will-it-scale/ you used?
> 
> I mainly followed Suneeth's steps [*]:
> 
> 1) git clone https://github.com/antonblanchard/will-it-scale.git
> 2) git clone https://github.com/intel/lkp-tests.git
> 3) cd will-it-scale && git apply 
> lkp-tests/programs/will-it-scale/pkg/will-it-scale.patch
> 4) make
> 5) python3 runtest.py mmap2 25 process 0 0 1 8 64 128 192 256
> 
> [*]: https://lore.kernel.org/all/262c742f-dc0c-4adc-b23c-047cd3298a5e@amd.com/

Thanks!

> 
> Sine the raw perf.data files are too big to be blocked, if you need to
> see any specific part of the content, I can paste the info for you.
> 
> Regards,
> Zhao
>

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Vlastimil Babka 3 weeks, 3 days ago

On 1/15/26 11:12, Zhao Liu wrote:
> Hi Babka & Hao,
> 
>> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
>> adjusted like this:
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index f21b2f0c6f5a..ad71f01571f0 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>>          */
>>  
>>         if (pcs->main->size == 0) {
>> -               barn_put_empty_sheaf(barn, pcs->main);
>> +               if (!pcs->spare) {
>> +                       pcs->spare = pcs->main;
>> +               } else {
>> +                       barn_put_empty_sheaf(barn, pcs->main);
>> +               }
>>                 pcs->main = full;
>>                 return pcs;
>>         }
> 
> I noticed the previous lkp regression report and tested this fix:
> 
> * will-it-scale.per_process_ops
> 
> Compared with v6.19-rc4(f0b9d8eb98df), with this fix, I have these
> results:
> 
> nr_tasks   Delta
> 1          + 3.593%
> 8          + 3.094%
> 64         +60.247%
> 128        +49.344%
> 192        +27.500%
> 256        -12.077%
> 
> For the cases (nr_tasks: 1-192), there're the improvements. I think
> this is expected since pre-cached spare sheaf reduces spinlock race:
> reduce barn_put_empty_sheaf() & barn_get_empty_sheaf().
> 
> So (maybe too late),
> 
> Tested-by: Zhao Liu <zhao1.liu@intel.com>

Thanks!

> But I find there are two more questions that might need consideration?
> 
> # Question 1: Regression for 256 tasks
> 
> For the above test - the case with nr_tasks: 256, there's a "slight"
> regression. I did more testing:
> 
> (This is a single-round test; the 256-tasks data has jitter.)
> 
> nr_tasks   Delta
> 244	     0.308%
> 248	   - 0.805%
> 252	    12.070%
> 256	   -11.441%
> 258	     2.070%
> 260	     1.252%
> 264	     2.369%
> 268	   -11.479%
> 272	     2.130%
> 292	     8.714%
> 296	    10.905%
> 298	    17.196%
> 300	    11.783%
> 302	     6.620%
> 304	     3.112%
> 308	   - 5.924%
> 
> It can be seen that most cases show improvement, though a few may
> experience slight regression.
> 
> Based on the configuration of my machine:
> 
>     GNR - 2 sockets with the following NUMA topology:
> 
>     NUMA:
>       NUMA node(s):              4
>       NUMA node0 CPU(s):         0-42,172-214
>       NUMA node1 CPU(s):         43-85,215-257
>       NUMA node2 CPU(s):         86-128,258-300
>       NUMA node3 CPU(s):         129-171,301-343
> 
> Since I set the CPU affinity on the core, 256 cases is roughly
> equivalent to the moment when Node 0 and Node 1 are filled.
> 
> The following is the perf data comparing 2 tests w/o fix & with this fix:
> 
> # Baseline  Delta Abs  Shared Object            Symbol
> # ........  .........  .......................  ....................................
> #
>     61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>      0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
>      0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
>      1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
>      3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
>      1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
>      0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
>      0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
>      1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
>      1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
>      1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
>      0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
>      0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
>      0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
>      0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
>      0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
>      0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
>      0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
>      0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
>      0.53%     -0.06%  libc.so.6                [.] __mmap
>      0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
>      0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
>      0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
>      0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
>      0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
>      0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
>      0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>      0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
>      0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
>      0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
>      0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
>      0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
>      0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
>      0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
>      0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
>      0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
>      0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
>      0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
>      0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
>      0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
>      0.49%     -0.04%  libc.so.6                [.] __munmap
>      0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
>      0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
>      0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
>      0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
>      0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
>      0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
>      0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
>      0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
>      0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
>      0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
>      0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
>      0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
>      0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
>      0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
>      0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc
> 
> I think the insteresting item is "get_partial_node". It seems this fix
> makes "get_partial_node" slightly more frequent. HMM, however, I still
> can't figure out why this is happening. Do you have any thoughts on it?

I'm not sure if it's statistically significant or just noise, +0.09% could
be noise?

> # Question 2: sheaf capacity
> 
> Back the original commit which triggerred lkp regression. I did more
> testing to check if this fix could totally fill the regression gap.
> 
> The base line is commit 3accabda4 ("mm, vma: use percpu sheaves for
> vm_area_struct cache") and its next commit 59faa4da7cd4 ("maple_tree:
> use percpu sheaves for maple_node_cache") has the regression.
> 
> I compared v6.19-rc4(f0b9d8eb98df) w/o fix & with fix aginst the base
> line:
> 
> nr_tasks	w/o fix         with fix
> 1		- 3.643%	- 0.181%
> 8		-12.523%	- 9.816%
> 64		-50.378%	-20.482%
> 128		-36.736%	- 5.518%
> 192		-22.963%	- 1.777%
> 256		-32.926%	- 41.026%
> 
> It appears that under extreme conditions, regression remains significate.
> I remembered your suggestion about larger capacity and did the following
> testing:
> 
> 	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4	59faa4da7cd4
> 			(with this fix)	(cap: 32->64)	(cap: 32->128)	(cap: 32->256)
> 1	-8.789%		-8.805%		-8.185%		-9.912%		-8.673%
> 8	-12.256%	-9.219%		-10.460%	-10.070%	-8.819%
> 64	-38.915%	-8.172%		-4.700%		4.571%		8.793%
> 128	-8.032%		11.377%		23.232%		26.940%		30.573%
> 192	-1.220%		9.758%		20.573%		22.645%		25.768%
> 256	-6.570%		9.967%		21.663%		30.103%		33.876%
> 
> Comparing with base line (3accabda4), larger capacity could
> significatly improve the Sheaf's scalability.
> 
> So, I'd like to know if you think dynamically or adaptively adjusting
> capacity is a worthwhile idea.

In the followup series, there will be automatically determined capacity to
roughly match the current capacity of cpu partial slabs:

https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/

We can use that as starting point for further tuning. But I suspect making
it adjust dynamically would be complicated.

> Thanks for your patience.
> 
> Regards,
> Zhao
>

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Zhao Liu 3 weeks, 2 days ago

> > The following is the perf data comparing 2 tests w/o fix & with this fix:
> > 
> > # Baseline  Delta Abs  Shared Object            Symbol
> > # ........  .........  .......................  ....................................
> > #
> >     61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> >      0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
> >      0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
> >      1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
> >      3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
> >      1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
> >      0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
> >      0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
> >      1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
> >      1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
> >      1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
> >      0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
> >      0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
> >      0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
> >      0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
> >      0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
> >      0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
> >      0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
> >      0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
> >      0.53%     -0.06%  libc.so.6                [.] __mmap
> >      0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
> >      0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
> >      0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
> >      0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
> >      0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
> >      0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
> >      0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
> >      0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
> >      0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
> >      0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
> >      0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
> >      0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
> >      0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
> >      0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
> >      0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
> >      0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
> >      0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
> >      0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
> >      0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
> >      0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
> >      0.49%     -0.04%  libc.so.6                [.] __munmap
> >      0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
> >      0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
> >      0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
> >      0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
> >      0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
> >      0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
> >      0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
> >      0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
> >      0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
> >      0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
> >      0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
> >      0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
> >      0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
> >      0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
> >      0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
> >      0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc
> > 
> > I think the insteresting item is "get_partial_node". It seems this fix
> > makes "get_partial_node" slightly more frequent. HMM, however, I still
> > can't figure out why this is happening. Do you have any thoughts on it?
> 
> I'm not sure if it's statistically significant or just noise, +0.09% could
> be noise?

small number does't always mean it's noise. When perf samples get_partial_node
on the spin lock call chain, its subroutines (spin lock) are hotter, so
the proportion of subroutine execution is higher. If the function -
get_partial_node itself (excluding subroutines) executes very quickly,
the proportion is lower.

I also expend the perf data with call chain:

* w/o fix:

We can calculate the proportion of spin locks introduced by get_partial_node
is: 31.05% / 49.91% = 62.21%

    49.91%  mmap2_processes  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
            |
             --49.91%--native_queued_spin_lock_slowpath
                       |
                        --49.91%--_raw_spin_lock_irqsave
                                  |
                                  |--31.05%--get_partial_node
                                  |          |
                                  |          |--23.66%--get_any_partial
                                  |          |          ___slab_alloc
                                  |          |
                                  |           --7.40%--___slab_alloc
                                  |                     __kmem_cache_alloc_bulk
                                  |
                                  |--10.84%--barn_get_empty_sheaf
                                  |          |
                                  |          |--6.18%--__kfree_rcu_sheaf
                                  |          |          kvfree_call_rcu
                                  |          |
                                  |           --4.66%--__pcs_replace_empty_main
                                  |                     kmem_cache_alloc_noprof
                                  |
                                  |--5.10%--barn_put_empty_sheaf
                                  |          |
                                  |           --5.09%--__pcs_replace_empty_main
                                  |                     kmem_cache_alloc_noprof
                                  |
                                  |--2.01%--barn_replace_empty_sheaf
                                  |          __pcs_replace_empty_main
                                  |          kmem_cache_alloc_noprof
                                  |
                                   --0.78%--__put_partials
                                             |
                                              --0.78%--__kmem_cache_free_bulk.part.0
                                                        rcu_free_sheaf


* with fix:

Similarly, the proportion of spin locks introduced by get_partial_node
is: 39.91% / 42.82% = 93.20%

    42.82%  mmap2_processes  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
            |
            ---native_queued_spin_lock_slowpath
               |
                --42.82%--_raw_spin_lock_irqsave
                          |
                          |--39.91%--get_partial_node
                          |          |
                          |          |--28.25%--get_any_partial
                          |          |          ___slab_alloc
                          |          |
                          |           --11.66%--___slab_alloc
                          |                     __kmem_cache_alloc_bulk
                          |
                          |--1.09%--barn_get_empty_sheaf
                          |          |
                          |           --0.90%--__kfree_rcu_sheaf
                          |                     kvfree_call_rcu
                          |
                          |--0.96%--barn_replace_empty_sheaf
                          |          __pcs_replace_empty_main
                          |          kmem_cache_alloc_noprof
                          |
                           --0.77%--__put_partials
                                     __kmem_cache_free_bulk.part.0
                                     rcu_free_sheaf


So, 62.21% -> 93.20% could reflect that get_partial_node contribute more
overhead at this point.

> > So, I'd like to know if you think dynamically or adaptively adjusting
> > capacity is a worthwhile idea.
> 
> In the followup series, there will be automatically determined capacity to
> roughly match the current capacity of cpu partial slabs:
> 
> https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/
> 
> We can use that as starting point for further tuning. But I suspect making
> it adjust dynamically would be complicated.

Thanks, will continue to evaluate this series.

Regards,
Zhao

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Hao Li 3 weeks, 2 days ago

On Fri, Jan 16, 2026 at 05:07:30PM +0800, Zhao Liu wrote:
> > > The following is the perf data comparing 2 tests w/o fix & with this fix:
> > > 
> > > # Baseline  Delta Abs  Shared Object            Symbol
> > > # ........  .........  .......................  ....................................
> > > #
> > >     61.76%     +4.78%  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
> > >      0.93%     -0.32%  [kernel.vmlinux]         [k] __slab_free
> > >      0.39%     -0.31%  [kernel.vmlinux]         [k] barn_get_empty_sheaf
> > >      1.35%     -0.30%  [kernel.vmlinux]         [k] mas_leaf_max_gap
> > >      3.22%     -0.30%  [kernel.vmlinux]         [k] __kmem_cache_alloc_bulk
> > >      1.73%     -0.20%  [kernel.vmlinux]         [k] __cond_resched
> > >      0.52%     -0.19%  [kernel.vmlinux]         [k] _raw_spin_lock_irqsave
> > >      0.92%     +0.18%  [kernel.vmlinux]         [k] _raw_spin_lock
> > >      1.91%     -0.15%  [kernel.vmlinux]         [k] zap_pmd_range.isra.0
> > >      1.37%     -0.13%  [kernel.vmlinux]         [k] mas_wr_node_store
> > >      1.29%     -0.12%  [kernel.vmlinux]         [k] free_pud_range
> > >      0.92%     -0.11%  [kernel.vmlinux]         [k] __mmap_region
> > >      0.12%     -0.11%  [kernel.vmlinux]         [k] barn_put_empty_sheaf
> > >      0.20%     -0.09%  [kernel.vmlinux]         [k] barn_replace_empty_sheaf
> > >      0.31%     +0.09%  [kernel.vmlinux]         [k] get_partial_node
> > >      0.29%     -0.07%  [kernel.vmlinux]         [k] __rcu_free_sheaf_prepare
> > >      0.12%     -0.07%  [kernel.vmlinux]         [k] intel_idle_xstate
> > >      0.21%     -0.07%  [kernel.vmlinux]         [k] __kfree_rcu_sheaf
> > >      0.26%     -0.07%  [kernel.vmlinux]         [k] down_write
> > >      0.53%     -0.06%  libc.so.6                [.] __mmap
> > >      0.66%     -0.06%  [kernel.vmlinux]         [k] mas_walk
> > >      0.48%     -0.06%  [kernel.vmlinux]         [k] mas_prev_slot
> > >      0.45%     -0.06%  [kernel.vmlinux]         [k] mas_find
> > >      0.38%     -0.06%  [kernel.vmlinux]         [k] mas_wr_store_type
> > >      0.23%     -0.06%  [kernel.vmlinux]         [k] do_vmi_align_munmap
> > >      0.21%     -0.05%  [kernel.vmlinux]         [k] perf_event_mmap_event
> > >      0.32%     -0.05%  [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
> > >      0.19%     -0.05%  [kernel.vmlinux]         [k] downgrade_write
> > >      0.59%     -0.05%  [kernel.vmlinux]         [k] mas_next_slot
> > >      0.31%     -0.05%  [kernel.vmlinux]         [k] __mmap_new_vma
> > >      0.44%     -0.05%  [kernel.vmlinux]         [k] kmem_cache_alloc_noprof
> > >      0.28%     -0.05%  [kernel.vmlinux]         [k] __vma_enter_locked
> > >      0.41%     -0.05%  [kernel.vmlinux]         [k] memcpy
> > >      0.48%     -0.04%  [kernel.vmlinux]         [k] mas_store_gfp
> > >      0.14%     +0.04%  [kernel.vmlinux]         [k] __put_partials
> > >      0.19%     -0.04%  [kernel.vmlinux]         [k] mas_empty_area_rev
> > >      0.30%     -0.04%  [kernel.vmlinux]         [k] do_syscall_64
> > >      0.25%     -0.04%  [kernel.vmlinux]         [k] mas_preallocate
> > >      0.15%     -0.04%  [kernel.vmlinux]         [k] rcu_free_sheaf
> > >      0.22%     -0.04%  [kernel.vmlinux]         [k] entry_SYSCALL_64
> > >      0.49%     -0.04%  libc.so.6                [.] __munmap
> > >      0.91%     -0.04%  [kernel.vmlinux]         [k] rcu_all_qs
> > >      0.21%     -0.04%  [kernel.vmlinux]         [k] __vm_munmap
> > >      0.24%     -0.04%  [kernel.vmlinux]         [k] mas_store_prealloc
> > >      0.19%     -0.04%  [kernel.vmlinux]         [k] __kmalloc_cache_noprof
> > >      0.34%     -0.04%  [kernel.vmlinux]         [k] build_detached_freelist
> > >      0.19%     -0.03%  [kernel.vmlinux]         [k] vms_complete_munmap_vmas
> > >      0.36%     -0.03%  [kernel.vmlinux]         [k] mas_rev_awalk
> > >      0.05%     -0.03%  [kernel.vmlinux]         [k] shuffle_freelist
> > >      0.19%     -0.03%  [kernel.vmlinux]         [k] down_write_killable
> > >      0.19%     -0.03%  [kernel.vmlinux]         [k] kmem_cache_free
> > >      0.27%     -0.03%  [kernel.vmlinux]         [k] up_write
> > >      0.13%     -0.03%  [kernel.vmlinux]         [k] vm_area_alloc
> > >      0.18%     -0.03%  [kernel.vmlinux]         [k] arch_get_unmapped_area_topdown
> > >      0.08%     -0.03%  [kernel.vmlinux]         [k] userfaultfd_unmap_complete
> > >      0.10%     -0.03%  [kernel.vmlinux]         [k] tlb_gather_mmu
> > >      0.30%     -0.02%  [kernel.vmlinux]         [k] ___slab_alloc
> > > 
> > > I think the insteresting item is "get_partial_node". It seems this fix
> > > makes "get_partial_node" slightly more frequent. HMM, however, I still
> > > can't figure out why this is happening. Do you have any thoughts on it?
> > 
> > I'm not sure if it's statistically significant or just noise, +0.09% could
> > be noise?
> 
> small number does't always mean it's noise. When perf samples get_partial_node
> on the spin lock call chain, its subroutines (spin lock) are hotter, so
> the proportion of subroutine execution is higher. If the function -
> get_partial_node itself (excluding subroutines) executes very quickly,
> the proportion is lower.
> 
> I also expend the perf data with call chain:
> 
> * w/o fix:
> 
> We can calculate the proportion of spin locks introduced by get_partial_node
> is: 31.05% / 49.91% = 62.21%
> 
>     49.91%  mmap2_processes  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>             |
>              --49.91%--native_queued_spin_lock_slowpath
>                        |
>                         --49.91%--_raw_spin_lock_irqsave
>                                   |
>                                   |--31.05%--get_partial_node
>                                   |          |
>                                   |          |--23.66%--get_any_partial
>                                   |          |          ___slab_alloc
>                                   |          |
>                                   |           --7.40%--___slab_alloc
>                                   |                     __kmem_cache_alloc_bulk
>                                   |
>                                   |--10.84%--barn_get_empty_sheaf
>                                   |          |
>                                   |          |--6.18%--__kfree_rcu_sheaf
>                                   |          |          kvfree_call_rcu
>                                   |          |
>                                   |           --4.66%--__pcs_replace_empty_main
>                                   |                     kmem_cache_alloc_noprof
>                                   |
>                                   |--5.10%--barn_put_empty_sheaf
>                                   |          |
>                                   |           --5.09%--__pcs_replace_empty_main
>                                   |                     kmem_cache_alloc_noprof
>                                   |
>                                   |--2.01%--barn_replace_empty_sheaf
>                                   |          __pcs_replace_empty_main
>                                   |          kmem_cache_alloc_noprof
>                                   |
>                                    --0.78%--__put_partials
>                                              |
>                                               --0.78%--__kmem_cache_free_bulk.part.0
>                                                         rcu_free_sheaf
> 
> 
> * with fix:
> 
> Similarly, the proportion of spin locks introduced by get_partial_node
> is: 39.91% / 42.82% = 93.20%
> 
>     42.82%  mmap2_processes  [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>             |
>             ---native_queued_spin_lock_slowpath
>                |
>                 --42.82%--_raw_spin_lock_irqsave
>                           |
>                           |--39.91%--get_partial_node
>                           |          |
>                           |          |--28.25%--get_any_partial
>                           |          |          ___slab_alloc
>                           |          |
>                           |           --11.66%--___slab_alloc
>                           |                     __kmem_cache_alloc_bulk
>                           |
>                           |--1.09%--barn_get_empty_sheaf
>                           |          |
>                           |           --0.90%--__kfree_rcu_sheaf
>                           |                     kvfree_call_rcu
>                           |
>                           |--0.96%--barn_replace_empty_sheaf
>                           |          __pcs_replace_empty_main
>                           |          kmem_cache_alloc_noprof
>                           |
>                            --0.77%--__put_partials
>                                      __kmem_cache_free_bulk.part.0
>                                      rcu_free_sheaf
> 
> 
> So, 62.21% -> 93.20% could reflect that get_partial_node contribute more
> overhead at this point.

Thanks for the detailed notes. I'll try to reproduce it to see what exactly
happened.

-- 
Thanks,
Hao

> 
> > > So, I'd like to know if you think dynamically or adaptively adjusting
> > > capacity is a worthwhile idea.
> > 
> > In the followup series, there will be automatically determined capacity to
> > roughly match the current capacity of cpu partial slabs:
> > 
> > https://lore.kernel.org/all/20260112-sheaves-for-all-v2-4-98225cfb50cf@suse.cz/
> > 
> > We can use that as starting point for further tuning. But I suspect making
> > it adjust dynamically would be complicated.
> 
> Thanks, will continue to evaluate this series.
> 
> Regards,
> Zhao
> 
>

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Harry Yoo 1 month, 2 weeks ago

On Mon, Dec 15, 2025 at 03:30:48PM +0100, Vlastimil Babka wrote:
> On 12/10/25 01:26, Hao Li wrote:
> > From: Hao Li <haolee.swjtu@gmail.com>
> > 
> > When __pcs_replace_empty_main() fails to obtain a full sheaf directly
> > from the barn, it may either:
> > 
> >   - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
> >   - Allocate a brand new full sheaf via alloc_full_sheaf().
> > 
> > After reacquiring the per-CPU lock, if pcs->main is still empty and
> > pcs->spare is NULL, the current code donates the empty main sheaf to
> > the barn via barn_put_empty_sheaf() and installs the full sheaf as
> > pcs->main, leaving pcs->spare unpopulated.
> > 
> > Instead, keep the existing empty main sheaf locally as the spare:
> > 
> >   pcs->spare = pcs->main;
> >   pcs->main = full;
> > 
> > This populates pcs->spare earlier, which can reduce future barn traffic.
> > 
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
> > ---
> > 
> > The Gmail account(haoli.tcs) I used to send v1 of the patch has been
> > restricted from sending emails for unknown reasons, so I'm sending v2
> > from this address instead. Thanks.
> > 
> >  mm/slub.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a0b905c2a557..a3e73ebb0cc8 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> >  	 */
> >  
> >  	if (pcs->main->size == 0) {
> > +		if (!pcs->spare) {
> > +			pcs->spare = pcs->main;
> > +			pcs->main = full;
> > +			return pcs;
> > +		}
> >  		barn_put_empty_sheaf(barn, pcs->main);
> >  		pcs->main = full;
> >  		return pcs;
> 
> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>          */
>  
>         if (pcs->main->size == 0) {
> -               barn_put_empty_sheaf(barn, pcs->main);
> +               if (!pcs->spare) {
> +                       pcs->spare = pcs->main;
> +               } else {
> +                       barn_put_empty_sheaf(barn, pcs->main);
> +               }

nit: no braces for single statement?

>                 pcs->main = full;
>                 return pcs;
>         }

Otherwise LGTM, so:
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Vlastimil Babka 1 month ago

On 12/22/25 11:20, Harry Yoo wrote:
> On Mon, Dec 15, 2025 at 03:30:48PM +0100, Vlastimil Babka wrote:

>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>>          */
>>  
>>         if (pcs->main->size == 0) {
>> -               barn_put_empty_sheaf(barn, pcs->main);
>> +               if (!pcs->spare) {
>> +                       pcs->spare = pcs->main;
>> +               } else {
>> +                       barn_put_empty_sheaf(barn, pcs->main);
>> +               }
> 
> nit: no braces for single statement?

Right, fixed up.

>>                 pcs->main = full;
>>                 return pcs;
>>         }
> 
> Otherwise LGTM, so:
> Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

Thanks!

Re: [PATCH v2] slub: keep empty main sheaf as spare in __pcs_replace_empty_main()

Posted by Hao Lee 1 month, 3 weeks ago

On Mon, Dec 15, 2025 at 10:30 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 12/10/25 01:26, Hao Li wrote:
> > From: Hao Li <haolee.swjtu@gmail.com>
> >
> > When __pcs_replace_empty_main() fails to obtain a full sheaf directly
> > from the barn, it may either:
> >
> >   - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or
> >   - Allocate a brand new full sheaf via alloc_full_sheaf().
> >
> > After reacquiring the per-CPU lock, if pcs->main is still empty and
> > pcs->spare is NULL, the current code donates the empty main sheaf to
> > the barn via barn_put_empty_sheaf() and installs the full sheaf as
> > pcs->main, leaving pcs->spare unpopulated.
> >
> > Instead, keep the existing empty main sheaf locally as the spare:
> >
> >   pcs->spare = pcs->main;
> >   pcs->main = full;
> >
> > This populates pcs->spare earlier, which can reduce future barn traffic.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Hao Li <haolee.swjtu@gmail.com>
> > ---
> >
> > The Gmail account(haoli.tcs) I used to send v1 of the patch has been
> > restricted from sending emails for unknown reasons, so I'm sending v2
> > from this address instead. Thanks.
> >
> >  mm/slub.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index a0b905c2a557..a3e73ebb0cc8 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -5077,6 +5077,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
> >        */
> >
> >       if (pcs->main->size == 0) {
> > +             if (!pcs->spare) {
> > +                     pcs->spare = pcs->main;
> > +                     pcs->main = full;
> > +                     return pcs;
> > +             }
> >               barn_put_empty_sheaf(barn, pcs->main);
> >               pcs->main = full;
> >               return pcs;
>
> Thanks, LGTM. We can make it smaller though. Adding to slab/for-next
> adjusted like this:

Thanks!

>
> diff --git a/mm/slub.c b/mm/slub.c
> index f21b2f0c6f5a..ad71f01571f0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -5052,7 +5052,11 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
>          */
>
>         if (pcs->main->size == 0) {
> -               barn_put_empty_sheaf(barn, pcs->main);
> +               if (!pcs->spare) {
> +                       pcs->spare = pcs->main;
> +               } else {
> +                       barn_put_empty_sheaf(barn, pcs->main);
> +               }

Nice simplification. Thanks!

>                 pcs->main = full;
>                 return pcs;
>         }
>
>
>