[PATCH v5 00/14] SLUB percpu sheaves

Vlastimil Babka posted 14 patches 2 months, 2 weeks ago
There is a newer version of this series
include/linux/maple_tree.h            |    6 +-
include/linux/slab.h                  |   47 +
lib/maple_tree.c                      |  393 +++-----
lib/test_maple_tree.c                 |    8 +
mm/slab.h                             |    4 +
mm/slab_common.c                      |   32 +-
mm/slub.c                             | 1646 +++++++++++++++++++++++++++++++--
mm/vma_init.c                         |    1 +
tools/include/linux/slab.h            |   65 +-
tools/testing/radix-tree/maple.c      |  639 +++----------
tools/testing/shared/linux.c          |  112 ++-
tools/testing/shared/linux/rcupdate.h |   22 +
12 files changed, 2104 insertions(+), 871 deletions(-)
[PATCH v5 00/14] SLUB percpu sheaves
Posted by Vlastimil Babka 2 months, 2 weeks ago
Hi,

This series adds an opt-in percpu array-based caching layer to SLUB.
It has evolved to a state where kmem caches with sheaves are compatible
with all SLUB features (slub_debug, SLUB_TINY, NUMA locality
considerations). My hope is therefore that it can eventually be enabled
for all kmem caches and replace the cpu (partial) slabs.

The v5 is posted for reviews and testing/benchmarking purposes. After
6.17-rc1 I hope to post a rebased v6 and start including it in
linux-next.

Note the name "sheaf" was invented by Matthew Wilcox so we don't call
the arrays magazines like the original Bonwick paper. The per-NUMA-node
cache of sheaves is thus called "barn".

This caching may seem similar to the arrays in SLAB, but there are some
important differences:

- does not distinguish NUMA locality, thus there are no per-node
  "shared" arrays (with possible lock contention) and no "alien" arrays
  that would need periodical flushing
  - NUMA restricted allocations and strict_numa mode is still honoured,
    the percpu sheaves are bypassed for those allocations
  - a later patch (for separate evaluation) makes freeing remote objects
    bypass sheaves so sheaves contain mostly (not strictly) local objects
- improves kfree_rcu() handling by reusing whole sheaves
- there is an API for obtaining a preallocated sheaf that can be used
  for guaranteed and efficient allocations in a restricted context, when
  the upper bound for needed objects is known but rarely reached
- opt-in, not used for every cache (for now)

The motivation comes mainly from the ongoing work related to VMA locking
scalability and the related maple tree operations. This is why VMA and
maple nodes caches are sheaf-enabled in the patchset. In v5 I include
Liam's patches for full maple tree conversion that uses the improved
preallocation API.

A sheaf-enabled cache has the following expected advantages:

- Cheaper fast paths. For allocations, instead of local double cmpxchg,
  thanks to local_trylock() it becomes a preempt_disable() and no atomic
  operations. Same for freeing, which is otherwise a local double cmpxchg
  only for short term allocations (so the same slab is still active on the
  same cpu when freeing the object) and a more costly locked double
  cmpxchg otherwise.

- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
  separate percpu sheaf and only submit the whole sheaf to call_rcu()
  when full. After the grace period, the sheaf can be used for
  allocations, which is more efficient than freeing and reallocating
  individual slab objects (even with the batching done by kfree_rcu()
  implementation itself). In case only some cpus are allowed to handle rcu
  callbacks, the sheaf can still be made available to other cpus on the
  same node via the shared barn. The maple_node cache uses kfree_rcu() and
  thus can benefit from this.

- Preallocation support. A prefilled sheaf can be privately borrowed to
  perform a short term operation that is not allowed to block in the
  middle and may need to allocate some objects. If an upper bound (worst
  case) for the number of allocations is known, but only much fewer
  allocations actually needed on average, borrowing and returning a sheaf
  is much more efficient then a bulk allocation for the worst case
  followed by a bulk free of the many unused objects. Maple tree write
  operations should benefit from this.

- Compatibility with slub_debug. When slub_debug is enabled for a cache,
  we simply don't create the percpu sheaves so that the debugging hooks
  (at the node partial list slowpaths) are reached as before. The same
  thing is done for CONFIG_SLUB_TINY. Sheaf preallocation still works by
  reusing the (ineffective) paths for requests exceeding the cache's
  sheaf_capacity. This is in line with the existing approach where
  debugging bypasses the fast paths and SLUB_TINY preferes memory
  savings over performance.

GIT TREES:

this series: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v5r0
It is based on v6.16-rc1.

this series plus a microbenchmark hacked into slub_kunit:
https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v5-benchmarking

It allows evaluating overhead of the added sheaves code, and benefits
for single-threaded allocation/frees of varying batch size. I plan to
look into adding multi-threaded scenarios too.

The last commit there also adds sheaves to every cache to allow
measuring effects on other caches than vma and maple node. Note these
measurements should be compared to slab_nomerge boots without sheaves,
as adding sheaves makes caches unmergeable.

RESULTS:

In order to get some numbers that should be only due to differences in
implementation and no cache layout side-effects in users of the slab
objects etc, I have started with a in-kernel microbenchmark that does
allocating and freeing from a slab cache with or without sheaves and/or
memcg. It's either alternating single object alloc and free, or
allocates 10 objects and frees them, then 100, then 1000
- in order to see the effects of exhausting percpu sheaves or barn, or
(without sheaves) the percpu slabs. The order of objects to free can
be also shuffled instead of FIFO - to stress the non-sheaf freeing
slowpath more.

Measurements done on Ryzen 7 5700, bare metal.

The first question was how just having the sheaves implementation affects
existing no-sheaf caches due to the extra (unused) code. I have experimented
with changing inlining and adding unlikely() to the sheaves case. The
optimum seems is what's currently in the implementation - fast-path sheaves
usage is inlined, any handling of main sheaf empty on alloc/full on free is
a separate function, and the if (s->sheaf_capacity) has neither likely() nor
unlikely(). When I added unlikely() it destroyed the performance of sheaves
completely.

So the result is that with batch size 10, there's 2.4% overhead, and the
other cases are all impacted less than this. Hopefully acceptable with the
plan that eventually there would be sheaves everywhere and the current
cpu (partial) slabs scheme removed.

As for benefits of enabling sheaves (capacity=32) see the results below,
looks all good here. Of course this microbenchmark is not a complete
story though for at least these reasons:

- no kfree_rcu() evaluation
- doesn't show barn spinlock contention effects. In theory shouldn't be
worse than without sheaves because after exhausting cpu (partial) slabs, the
list_lock has to be taken. Sheaf capacity vs capacity of partial slabs is a
matter of tuning.

---------------------------------
 BATCH SIZE: 1 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115660272
 bench: no memcg, sheaves
 average (excl. iter 0): 95734972
 sheaves better by 17.2%
 bench: memcg, no sheaves
 average (excl. iter 0): 163682964
 bench: memcg, sheaves
 average (excl. iter 0): 144792803
 sheaves better by 11.5%

 ---------------------------------
 BATCH SIZE: 10 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115496906
 bench: no memcg, sheaves
 average (excl. iter 0): 97781102
 sheaves better by 15.3%
 bench: memcg, no sheaves
 average (excl. iter 0): 162771491
 bench: memcg, sheaves
 average (excl. iter 0): 144746490
 sheaves better by 11.0%

 ---------------------------------
 BATCH SIZE: 100 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 151796052
 bench: no memcg, sheaves
 average (excl. iter 0): 104641753
 sheaves better by 31.0%
 bench: memcg, no sheaves
 average (excl. iter 0): 200733436
 bench: memcg, sheaves
 average (excl. iter 0): 151340989
 sheaves better by 24.6%

 ---------------------------------
 BATCH SIZE: 1000 SHUFFLED: NO
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 187623118
 bench: no memcg, sheaves
 average (excl. iter 0): 130914624
 sheaves better by 30.2%
 bench: memcg, no sheaves
 average (excl. iter 0): 240239575
 bench: memcg, sheaves
 average (excl. iter 0): 181474462
 sheaves better by 24.4%

 ---------------------------------
 BATCH SIZE: 10 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 115110219
 bench: no memcg, sheaves
 average (excl. iter 0): 100597405
 sheaves better by 12.6%
 bench: memcg, no sheaves
 average (excl. iter 0): 163573377
 bench: memcg, sheaves
 average (excl. iter 0): 144535545
 sheaves better by 11.6%

 ---------------------------------
 BATCH SIZE: 100 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 152457970
 bench: no memcg, sheaves
 average (excl. iter 0): 108720274
 sheaves better by 28.6%
 bench: memcg, no sheaves
 average (excl. iter 0): 203478732
 bench: memcg, sheaves
 average (excl. iter 0): 151241821
 sheaves better by 25.6%

 ---------------------------------
 BATCH SIZE: 1000 SHUFFLED: YES
 ---------------------------------

 bench: no memcg, no sheaves
 average (excl. iter 0): 189950559
 bench: no memcg, sheaves
 average (excl. iter 0): 177934450
 sheaves better by 6.3%
 bench: memcg, no sheaves
 average (excl. iter 0): 242988187
 bench: memcg, sheaves
 average (excl. iter 0): 221609979
 sheaves better by 8.7%

Vlastimil

---
Changes in v5:
- Apply review tags (Harry, Suren) except where changed too much (first
  patch).
- Handle CONFIG_SLUB_TINY by not creating percpu sheaves (Harry)
- Apply review feedback (typos, comments).
- Extract handling sheaf slow paths to separate non-inline functions
  __pcs_handle_empty() and __pcs_handle_full().
- Fix empty sheaf leak in rcu_free_sheaf() (Suren)
- Add "allow NUMA restricted allocations to use percpu sheaves".
- Add Liam's maple tree full sheaf conversion patches for easier
  evaluation.
- Rebase to v6.16-rc1.
- Link to v4: https://patch.msgid.link/20250425-slub-percpu-caches-v4-0-8a636982b4a4@suse.cz

Changes in v4:
- slub_debug disables sheaves for the cache in order to work properly
- strict_numa mode works as intended
- added a separate patch to make freeing remote objects skip sheaves
- various code refactoring suggested by Suren and Harry
- removed less useful stat counters and added missing ones for barn
  and prefilled sheaf events
- Link to v3: https://lore.kernel.org/r/20250317-slub-percpu-caches-v3-0-9d9884d8b643@suse.cz

Changes in v3:
- Squash localtry_lock conversion so it's used immediately.
- Incorporate feedback and add tags from Suren and Harry - thanks!
  - Mostly adding comments and some refactoring.
  - Fixes for kfree_rcu_sheaf() vmalloc handling, cpu hotremove
    flushing.
  - Fix wrong condition in kmem_cache_return_sheaf() that may have
    affected performance negatively.
  - Refactoring of free_to_pcs()
- Link to v2: https://lore.kernel.org/r/20250214-slub-percpu-caches-v2-0-88592ee0966a@suse.cz

Changes in v2:
- Removed kfree_rcu() destructors support as VMAs will not need it
  anymore after [3] is merged.
- Changed to localtry_lock_t borrowed from [2] instead of an own
  implementation of the same idea.
- Many fixes and improvements thanks to Liam's adoption for maple tree
  nodes.
- Userspace Testing stubs by Liam.
- Reduced limitations/todos - hooking to kfree_rcu() is complete,
  prefilled sheaves can exceed cache's sheaf_capacity.
- Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz

---
Liam R. Howlett (6):
      tools: Add testing support for changes to rcu and slab for sheaves
      tools: Add sheaves support to testing infrastructure
      testing/radix-tree/maple: Increase readers and reduce delay for faster machines
      maple_tree: Sheaf conversion
      maple_tree: Add single node allocation support to maple state
      maple_tree: Convert forking to use the sheaf interface

Vlastimil Babka (8):
      slab: add opt-in caching layer of percpu sheaves
      slab: add sheaf support for batching kfree_rcu() operations
      slab: sheaf prefilling for guaranteed allocations
      slab: determine barn status racily outside of lock
      maple_tree: use percpu sheaves for maple_node_cache
      mm, vma: use percpu sheaves for vm_area_struct cache
      mm, slub: skip percpu sheaves for remote object freeing
      mm, slab: allow NUMA restricted allocations to use percpu sheaves

 include/linux/maple_tree.h            |    6 +-
 include/linux/slab.h                  |   47 +
 lib/maple_tree.c                      |  393 +++-----
 lib/test_maple_tree.c                 |    8 +
 mm/slab.h                             |    4 +
 mm/slab_common.c                      |   32 +-
 mm/slub.c                             | 1646 +++++++++++++++++++++++++++++++--
 mm/vma_init.c                         |    1 +
 tools/include/linux/slab.h            |   65 +-
 tools/testing/radix-tree/maple.c      |  639 +++----------
 tools/testing/shared/linux.c          |  112 ++-
 tools/testing/shared/linux/rcupdate.h |   22 +
 12 files changed, 2104 insertions(+), 871 deletions(-)
---
base-commit: 82efd569a8909f2b13140c1b3de88535aea0b051
change-id: 20231128-slub-percpu-caches-9441892011d7

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>
Re: [PATCH v5 00/14] SLUB percpu sheaves
Posted by Sudarsan Mahendran 1 month, 3 weeks ago
Hi Vlastimil,

I ported this patch series on top of v6.17.
I had to resolve some merge conflicts because of 
fba46a5d83ca8decb338722fb4899026d8d9ead2

The conflict resolution looks like:

@@ -5524,20 +5335,19 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
 int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
 {
        MA_WR_STATE(wr_mas, mas, entry);
-       int ret = 0;
-       int request;

        mas_wr_prealloc_setup(&wr_mas);
        mas->store_type = mas_wr_store_type(&wr_mas);
-       request = mas_prealloc_calc(&wr_mas, entry);
-       if (!request)
+       mas_prealloc_calc(&wr_mas, entry);
+       if (!mas->node_request)
                goto set_flag;

        mas->mas_flags &= ~MA_STATE_PREALLOC;
-       mas_node_count_gfp(mas, request, gfp);
+       mas_alloc_nodes(mas, gfp);
        if (mas_is_err(mas)) {
-               mas_set_alloc_req(mas, 0);
-               ret = xa_err(mas->node);
+               int ret = xa_err(mas->node);
+
+               mas->node_request = 0;
                mas_destroy(mas);
                mas_reset(mas);
                return ret;
@@ -5545,7 +5355,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)

 set_flag:
        mas->mas_flags |= MA_STATE_PREALLOC;
-       return ret;
+       return 0;
 }
 EXPORT_SYMBOL_GPL(mas_preallocate);



When I try to boot this kernel, I see kernel panic
with rcu_free_sheaf() doing recursion into __kmem_cache_free_bulk()

Stack trace:

[    1.583673] Oops: stack guard page: 0000 [#1] SMP NOPTI
[    1.583676] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.17.0-smp-sheaves2 #1 NONE
[    1.583679] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
[    1.583684] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
[    1.583685] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
[    1.583687] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
[    1.583687] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
[    1.583688] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
[    1.583689] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
[    1.583690] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
[    1.583690] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
[    1.583691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.583692] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
[    1.583692] PKRU: 55555554
[    1.583693] Call Trace:
[    1.583694]  <IRQ>
[    1.583696]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583698]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583700]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583702]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583703]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583705]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583707]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583708]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583710]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583711]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583713]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583715]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583716]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583718]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583719]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583721]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583723]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583724]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583726]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583727]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583729]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583731]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583732]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583734]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583735]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583737]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583739]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583740]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583742]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583743]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583745]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583747]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583748]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583750]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583751]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583753]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583755]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583756]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583758]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583759]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583761]  ? update_group_capacity+0xad/0x1f0
[    1.583763]  ? sched_balance_rq+0x4f6/0x1e80
[    1.583765]  __kmem_cache_free_bulk+0x2c7/0x540
[    1.583767]  ? update_irq_load_avg+0x35/0x480
[    1.583768]  ? __pfx_rcu_free_sheaf+0x10/0x10
[    1.583769]  rcu_free_sheaf+0x86/0x110
[    1.583771]  rcu_do_batch+0x245/0x750
[    1.583772]  rcu_core+0x13a/0x260
[    1.583773]  handle_softirqs+0xcb/0x270
[    1.583775]  __irq_exit_rcu+0x48/0xf0
[    1.583776]  sysvec_apic_timer_interrupt+0x74/0x80
[    1.583778]  </IRQ>
[    1.583778]  <TASK>
[    1.583779]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[    1.583780] RIP: 0010:cpuidle_enter_state+0x101/0x290
[    1.583781] Code: 85 f4 ff ff 49 89 c4 8b 73 04 bf ff ff ff ff e8 d5 44 d4 ff 31 ff e8 9e c7 37 ff 80 7c 24 04 00 74 05 e8 12 45 d4 ff fb 85 ed <0f> 88 ba 00 00 00 89 e9 48 6b f9 68 4c 8b 44 24 08 49 8b 54 38 30
[    1.583782] RSP: 0018:ff40dbc4809afe80 EFLAGS: 00000202
[    1.583782] RAX: ff2aff31ba00b000 RBX: ff2afe75614b0800 RCX: 000000005e64b52b
[    1.583783] RDX: 000000005e73f761 RSI: 0000000000000067 RDI: 0000000000000000
[    1.583783] RBP: 0000000000000002 R08: fffffffffffffff6 R09: 0000000000000000
[    1.583784] R10: 0000000000000380 R11: ffffffff908c38d0 R12: 000000005e64b535
[    1.583784] R13: 000000005e5580da R14: ffffffff92890b10 R15: 0000000000000002
[    1.583784]  ? __pfx_read_tsc+0x10/0x10
[    1.583787]  cpuidle_enter+0x2c/0x40
[    1.583788]  do_idle+0x1a7/0x240
[    1.583790]  cpu_startup_entry+0x2a/0x30
[    1.583791]  start_secondary+0x95/0xa0
[    1.583794]  common_startup_64+0x13e/0x140
[    1.583796]  </TASK>
[    1.583796] Modules linked in:
[    1.583798] ---[ end trace 0000000000000000 ]---
[    1.583798] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
[    1.583800] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
[    1.583800] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
[    1.583801] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
[    1.583801] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
[    1.583801] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
[    1.583802] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
[    1.583802] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
[    1.583802] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
[    1.583803] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.583803] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
[    1.583803] PKRU: 55555554
[    1.583804] Kernel panic - not syncing: Fatal exception in interrupt
[    1.584659] Kernel Offset: 0xf600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Re: [PATCH v5 00/14] SLUB percpu sheaves
Posted by Harry Yoo 1 month, 2 weeks ago
On Fri, Aug 15, 2025 at 03:53:00PM -0700, Sudarsan Mahendran wrote:
> Hi Vlastimil,
> 
> I ported this patch series on top of v6.17.
> I had to resolve some merge conflicts because of 
> fba46a5d83ca8decb338722fb4899026d8d9ead2
> 
> The conflict resolution looks like:
> 
> @@ -5524,20 +5335,19 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>  int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>  {
>         MA_WR_STATE(wr_mas, mas, entry);
> -       int ret = 0;
> -       int request;
> 
>         mas_wr_prealloc_setup(&wr_mas);
>         mas->store_type = mas_wr_store_type(&wr_mas);
> -       request = mas_prealloc_calc(&wr_mas, entry);
> -       if (!request)
> +       mas_prealloc_calc(&wr_mas, entry);
> +       if (!mas->node_request)
>                 goto set_flag;
> 
>         mas->mas_flags &= ~MA_STATE_PREALLOC;
> -       mas_node_count_gfp(mas, request, gfp);
> +       mas_alloc_nodes(mas, gfp);
>         if (mas_is_err(mas)) {
> -               mas_set_alloc_req(mas, 0);
> -               ret = xa_err(mas->node);
> +               int ret = xa_err(mas->node);
> +
> +               mas->node_request = 0;
>                 mas_destroy(mas);
>                 mas_reset(mas);
>                 return ret;
> @@ -5545,7 +5355,7 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> 
>  set_flag:
>         mas->mas_flags |= MA_STATE_PREALLOC;
> -       return ret;
> +       return 0;
>  }
>  EXPORT_SYMBOL_GPL(mas_preallocate);
> 
> 
> 
> When I try to boot this kernel, I see kernel panic
> with rcu_free_sheaf() doing recursion into __kmem_cache_free_bulk()
> 
> Stack trace:
> 
> [    1.583673] Oops: stack guard page: 0000 [#1] SMP NOPTI
> [    1.583676] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.17.0-smp-sheaves2 #1 NONE
> [    1.583679] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
> [    1.583684] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
> [    1.583685] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
> [    1.583687] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
> [    1.583687] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
> [    1.583688] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
> [    1.583689] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
> [    1.583690] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
> [    1.583690] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
> [    1.583691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.583692] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
> [    1.583692] PKRU: 55555554
> [    1.583693] Call Trace:
> [    1.583694]  <IRQ>
> [    1.583696]  __kmem_cache_free_bulk+0x2c7/0x540

[..]

> [    1.583759]  __kmem_cache_free_bulk+0x2c7/0x540

Hi Sudarsan, thanks for the report.

I'm not really sure how __kmem_cache_free_bulk() can call itself.
There's no recursion of __kmem_cache_free_bulk() in the code.

As v6.17-rc1 is known to cause a few surprising bugs, could you please
rebase onto of mm-hotfixes-unstable and check if it still reproduces?

> [    1.583761]  ? update_group_capacity+0xad/0x1f0
> [    1.583763]  ? sched_balance_rq+0x4f6/0x1e80
> [    1.583765]  __kmem_cache_free_bulk+0x2c7/0x540
> [    1.583767]  ? update_irq_load_avg+0x35/0x480
> [    1.583768]  ? __pfx_rcu_free_sheaf+0x10/0x10
> [    1.583769]  rcu_free_sheaf+0x86/0x110
> [    1.583771]  rcu_do_batch+0x245/0x750
> [    1.583772]  rcu_core+0x13a/0x260
> [    1.583773]  handle_softirqs+0xcb/0x270
> [    1.583775]  __irq_exit_rcu+0x48/0xf0
> [    1.583776]  sysvec_apic_timer_interrupt+0x74/0x80
> [    1.583778]  </IRQ>
> [    1.583778]  <TASK>
> [    1.583779]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [    1.583780] RIP: 0010:cpuidle_enter_state+0x101/0x290
> [    1.583781] Code: 85 f4 ff ff 49 89 c4 8b 73 04 bf ff ff ff ff e8 d5 44 d4 ff 31 ff e8 9e c7 37 ff 80 7c 24 04 00 74 05 e8 12 45 d4 ff fb 85 ed <0f> 88 ba 00 00 00 89 e9 48 6b f9 68 4c 8b 44 24 08 49 8b 54 38 30
> [    1.583782] RSP: 0018:ff40dbc4809afe80 EFLAGS: 00000202
> [    1.583782] RAX: ff2aff31ba00b000 RBX: ff2afe75614b0800 RCX: 000000005e64b52b
> [    1.583783] RDX: 000000005e73f761 RSI: 0000000000000067 RDI: 0000000000000000
> [    1.583783] RBP: 0000000000000002 R08: fffffffffffffff6 R09: 0000000000000000
> [    1.583784] R10: 0000000000000380 R11: ffffffff908c38d0 R12: 000000005e64b535
> [    1.583784] R13: 000000005e5580da R14: ffffffff92890b10 R15: 0000000000000002
> [    1.583784]  ? __pfx_read_tsc+0x10/0x10
> [    1.583787]  cpuidle_enter+0x2c/0x40
> [    1.583788]  do_idle+0x1a7/0x240
> [    1.583790]  cpu_startup_entry+0x2a/0x30
> [    1.583791]  start_secondary+0x95/0xa0
> [    1.583794]  common_startup_64+0x13e/0x140
> [    1.583796]  </TASK>
> [    1.583796] Modules linked in:
> [    1.583798] ---[ end trace 0000000000000000 ]---
> [    1.583798] RIP: 0010:__kmem_cache_free_bulk+0x57/0x540
> [    1.583800] Code: 48 85 f6 0f 84 b8 04 00 00 49 89 d6 49 89 ff 48 85 ff 0f 84 fe 03 00 00 49 83 7f 08 00 0f 84 f3 03 00 00 0f 1f 44 00 00 31 c0 <48> 89 44 24 18 65 8b 05 6d 26 dc 02 89 44 24 2c 31 ff 89 f8 c7 44
> [    1.583800] RSP: 0018:ff40dbc49b048fc0 EFLAGS: 00010246
> [    1.583801] RAX: 0000000000000000 RBX: 0000000000000012 RCX: ffffffff939e8640
> [    1.583801] RDX: ff2afe75213e6c90 RSI: 0000000000000012 RDI: ff2afe750004ad00
> [    1.583801] RBP: ff40dbc49b049130 R08: ff2afe75368c2500 R09: ff2afe75368c3b00
> [    1.583802] R10: ff2afe75368c2500 R11: ff2afe75368c3b00 R12: ff2aff31ba00b000
> [    1.583802] R13: ffffffff939e8640 R14: ff2afe75213e6c90 R15: ff2afe750004ad00
> [    1.583802] FS:  0000000000000000(0000) GS:ff2aff31ba00b000(0000) knlGS:0000000000000000
> [    1.583803] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    1.583803] CR2: ff40dbc49b048fb8 CR3: 0000000017c3e001 CR4: 0000000000771ef0
> [    1.583803] PKRU: 55555554
> [    1.583804] Kernel panic - not syncing: Fatal exception in interrupt
> [    1.584659] Kernel Offset: 0xf600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> 
>
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Sudarsan Mahendran 3 weeks, 2 days ago
Hi Vlastimil,

I ported this patch series on top of v6.17 and ran
some benchmarks: will-it-scale, hackbench, redis,
unixbench and kernbench. I ran the benchmarks
on Intel Granite Rapids (480 cores), AMD Turin (512 cores)
and ARM (80 cores)

Summary of the results:

- Significant change (meaning >10% difference
  between base and experiment) on will-it-scale
  tests in AMD.
- No significant change on other benchmarks ran.

Summary of AMD will-it-scale test changes:

Number of runs : 15
Direction      : + is good

|            | MIN        | MAX        | MEAN       | MEDIAN     | STDDEV     |
|:-----------|:-----------|:-----------|:-----------|:-----------|:-----------|
| brk1_8_processes
| BASE       | 7,667,220  | 7,705,767  | 7,682,782  | 7,676,211  | 12,733     |
| TEST       | 9,477,395  | 10,053,058 | 9,878,753  | 9,959,360  | 182,014    |
| %          | +23.61%    | +30.46%    | +28.58%    | +29.74%    | +1,329.46% |
| brk1_8_threads
| BASE       | 1,838,468  | 1,890,873  | 1,864,473  | 1,864,875  | 18,406     |
| TEST       | 1,464,917  | 1,668,207  | 1,608,345  | 1,654,578  | 70,558     |
| %          | -20.32%    | -11.78%    | -13.74%    | -11.28%    | +283.34%   |
| brk1_128_processes
| BASE       | 65,018,211 | 65,603,366 | 65,285,348 | 65,103,197 | 232,885    |
| TEST       | 51,287,801 | 52,161,228 | 51,647,807 | 51,509,505 | 277,326    |
| %          | -21.12%    | -20.49%    | -20.89%    | -20.88%    | +19.08%    |
| brk1_256_processes
| BASE       | 15,186,881 | 15,239,120 | 15,210,265 | 15,205,809 | 15,850     |
| TEST       | 20,478,924 | 23,936,204 | 22,754,698 | 22,771,320 | 1,255,974  |
| %          | +34.85%    | +57.07%    | +49.60%    | +49.75%    | +7,823.72% |
| brk1_384_processes
| BASE       | 11,587,076 | 11,851,775 | 11,765,869 | 11,806,007 | 73,736     |
| TEST       | 25,464,757 | 29,176,818 | 26,695,946 | 26,563,116 | 1,012,563  |
| %          | +119.77%   | +146.18%   | +126.89%   | +125.00%   | +1,273.23% |
| brk1_512_processes
| BASE       | 14,410,293 | 14,891,082 | 14,775,209 | 14,793,142 | 114,384    |
| TEST       | 22,918,195 | 29,648,177 | 25,204,321 | 25,128,471 | 1,604,037  |
| %          | +59.04%    | +99.10%    | +70.59%    | +69.87%    | +1,302.32% |
| mmap1_8_processes
| BASE       | 3,164,411  | 3,170,585  | 3,167,590  | 3,167,692  | 2,436      |
| TEST       | 3,516,242  | 3,756,209  | 3,684,585  | 3,698,731  | 68,659     |
| %          | +11.12%    | +18.47%    | +16.32%    | +16.76%    | +2,718.28% |
| mmap1_8_threads
| BASE       | 627,817    | 632,702    | 630,554    | 629,281    | 1,764      |
| TEST       | 541,202    | 554,097    | 549,104    | 549,896    | 4,257      |
| %          | -13.80%    | -12.42%    | -12.92%    | -12.62%    | +141.34%   |
| mmap1_128_processes
| BASE       | 30,303,429 | 30,736,686 | 30,466,107 | 30,343,821 | 174,985    |
| TEST       | 9,749,426  | 9,893,331  | 9,823,701  | 9,857,157  | 52,125     |
| %          | -67.83%    | -67.81%    | -67.76%    | -67.52%    | -70.21%    |
| mmap1_256_processes
| BASE       | 7,496,765  | 7,546,703  | 7,528,379  | 7,543,246  | 21,465     |
| TEST       | 10,868,119 | 16,947,857 | 12,695,418 | 11,608,083 | 2,157,787  |
| %          | +44.97%    | +124.57%   | +68.63%    | +53.89%    | +9,952.34% |
| mmap1_384_processes
| BASE       | 5,629,206  | 5,856,927  | 5,758,347  | 5,733,892  | 85,930     |
| TEST       | 12,053,514 | 13,635,555 | 12,966,975 | 13,283,450 | 606,325    |
| %          | +114.12%   | +135.49%   | +112.89%   | +104.26%   | +2,855.57% |
| mmap1_512_processes
| BASE       | 6,959,199  | 6,996,383  | 6,975,912  | 6,974,353  | 15,446     |
| TEST       | 10,197,814 | 12,029,690 | 11,458,180 | 11,381,726 | 534,690    |
| %          | +46.54%    | +71.94%    | +64.25%    | +63.19%    | +3,361.67% |
| tlb_flush2_384_threads
| BASE       | 2,953,477  | 3,021,464  | 3,003,512  | 3,014,264  | 25,525     |
| TEST       | 2,231,417  | 2,526,876  | 2,408,368  | 2,411,121  | 115,773    |
| %          | -24.45%    | -16.37%    | -19.81%    | -20.01%    | +353.55%   |
| tlb_flush2_512_threads
| BASE       | 2,499,486  | 2,542,966  | 2,520,278  | 2,530,049  | 17,161     |
| TEST       | 1,707,641  | 1,714,524  | 1,708,951  | 1,707,713  | 1,877      |
| %          | -31.68%    | -32.58%    | -32.19%    | -32.50%    | -89.06%    |
| mmap2_128_processes
| BASE       | 29,754,984 | 30,313,146 | 30,010,106 | 29,897,731 | 218,812    |
| TEST       | 9,688,640  | 9,750,688  | 9,710,137  | 9,696,830  | 23,428     |
| %          | -67.44%    | -67.83%    | -67.64%    | -67.57%    | -89.29%    |
| mmap2_256_processes
| BASE       | 7,483,929  | 7,532,461  | 7,491,876  | 7,489,398  | 11,134     |
| TEST       | 11,580,023 | 16,508,551 | 15,337,145 | 15,943,608 | 1,489,489  |
| %          | +54.73%    | +119.17%   | +104.72%   | +112.88%   | +13,276.75%|
| mmap2_384_processes
| BASE       | 5,725,503  | 5,826,364  | 5,763,341  | 5,765,247  | 29,674     |
| TEST       | 11,682,353 | 13,720,566 | 12,269,665 | 11,776,228 | 877,060    |
| %          | +104.04%   | +135.49%   | +112.89%   | +104.26%   | +2,855.57% |
| mmap2_512_processes
| BASE       | 6,959,199  | 6,996,383  | 6,975,912  | 6,974,353  | 15,446     |
| TEST       | 10,197,814 | 12,029,690 | 11,458,180 | 11,381,726 | 534,690    |
| %          | +46.54%    | +71.94%    | +64.25%    | +63.19%    | +3,361.67% |
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Jan Engelhardt 2 weeks, 6 days ago
On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
>
>Summary of the results:
>
>- Significant change (meaning >10% difference
>  between base and experiment) on will-it-scale
>  tests in AMD.
>
>Summary of AMD will-it-scale test changes:
>
>Number of runs : 15
>Direction      : + is good

If STDDEV grows more than mean, there is more jitter,
which is not "good".

>|            | MIN        | MAX        | MEAN       | MEDIAN     | STDDEV     |
>|:-----------|:-----------|:-----------|:-----------|:-----------|:-----------|
>| brk1_8_processes
>| BASE       | 7,667,220  | 7,705,767  | 7,682,782  | 7,676,211  | 12,733     |
>| TEST       | 9,477,395  | 10,053,058 | 9,878,753  | 9,959,360  | 182,014    |
>| %          | +23.61%    | +30.46%    | +28.58%    | +29.74%    | +1,329.46% |
>
>| mmap2_256_processes
>| BASE       | 7,483,929  | 7,532,461  | 7,491,876  | 7,489,398  | 11,134     |
>| TEST       | 11,580,023 | 16,508,551 | 15,337,145 | 15,943,608 | 1,489,489  |
>| %          | +54.73%    | +119.17%   | +104.72%   | +112.88%   | +13,276.75%|
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Paul E. McKenney 2 weeks, 6 days ago
On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> 
> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> >
> >Summary of the results:
> >
> >- Significant change (meaning >10% difference
> >  between base and experiment) on will-it-scale
> >  tests in AMD.
> >
> >Summary of AMD will-it-scale test changes:
> >
> >Number of runs : 15
> >Direction      : + is good
> 
> If STDDEV grows more than mean, there is more jitter,
> which is not "good".

This is true.  On the other hand, the mean grew way more in absolute
terms than did STDDEV.  So might this be a reasonable tradeoff?

Of course, if adjustments can be made to keep the increase in mean while
keeping STDDEV low, that would of course be even better.

							Thanx, Paul

> >|            | MIN        | MAX        | MEAN       | MEDIAN     | STDDEV     |
> >|:-----------|:-----------|:-----------|:-----------|:-----------|:-----------|
> >| brk1_8_processes
> >| BASE       | 7,667,220  | 7,705,767  | 7,682,782  | 7,676,211  | 12,733     |
> >| TEST       | 9,477,395  | 10,053,058 | 9,878,753  | 9,959,360  | 182,014    |
> >| %          | +23.61%    | +30.46%    | +28.58%    | +29.74%    | +1,329.46% |
> >
> >| mmap2_256_processes
> >| BASE       | 7,483,929  | 7,532,461  | 7,491,876  | 7,489,398  | 11,134     |
> >| TEST       | 11,580,023 | 16,508,551 | 15,337,145 | 15,943,608 | 1,489,489  |
> >| %          | +54.73%    | +119.17%   | +104.72%   | +112.88%   | +13,276.75%|
>
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Vlastimil Babka 2 weeks, 6 days ago
On 9/15/25 14:13, Paul E. McKenney wrote:
> On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
>> 
>> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
>> >
>> >Summary of the results:

In any case, thanks a lot for the results!

>> >- Significant change (meaning >10% difference
>> >  between base and experiment) on will-it-scale
>> >  tests in AMD.
>> >
>> >Summary of AMD will-it-scale test changes:
>> >
>> >Number of runs : 15
>> >Direction      : + is good
>> 
>> If STDDEV grows more than mean, there is more jitter,
>> which is not "good".
> 
> This is true.  On the other hand, the mean grew way more in absolute
> terms than did STDDEV.  So might this be a reasonable tradeoff?

Also I'd point out that MIN of TEST is better than MAX of BASE, which means
there's always an improvement for this config. So jitter here means it's
changing between better and more better :) and not between worse and (more)
better.

The annoying part of course is that for other configs it's consistently the
opposite.

> Of course, if adjustments can be made to keep the increase in mean while
> keeping STDDEV low, that would of course be even better.
> 
> 							Thanx, Paul
> 
>> >|            | MIN        | MAX        | MEAN       | MEDIAN     | STDDEV     |
>> >|:-----------|:-----------|:-----------|:-----------|:-----------|:-----------|
>> >| brk1_8_processes
>> >| BASE       | 7,667,220  | 7,705,767  | 7,682,782  | 7,676,211  | 12,733     |
>> >| TEST       | 9,477,395  | 10,053,058 | 9,878,753  | 9,959,360  | 182,014    |
>> >| %          | +23.61%    | +30.46%    | +28.58%    | +29.74%    | +1,329.46% |
>> >
>> >| mmap2_256_processes
>> >| BASE       | 7,483,929  | 7,532,461  | 7,491,876  | 7,489,398  | 11,134     |
>> >| TEST       | 11,580,023 | 16,508,551 | 15,337,145 | 15,943,608 | 1,489,489  |
>> >| %          | +54.73%    | +119.17%   | +104.72%   | +112.88%   | +13,276.75%|
>>
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Suren Baghdasaryan 2 weeks, 5 days ago
On Mon, Sep 15, 2025 at 8:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 9/15/25 14:13, Paul E. McKenney wrote:
> > On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> >>
> >> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> >> >
> >> >Summary of the results:
>
> In any case, thanks a lot for the results!
>
> >> >- Significant change (meaning >10% difference
> >> >  between base and experiment) on will-it-scale
> >> >  tests in AMD.
> >> >
> >> >Summary of AMD will-it-scale test changes:
> >> >
> >> >Number of runs : 15
> >> >Direction      : + is good
> >>
> >> If STDDEV grows more than mean, there is more jitter,
> >> which is not "good".
> >
> > This is true.  On the other hand, the mean grew way more in absolute
> > terms than did STDDEV.  So might this be a reasonable tradeoff?
>
> Also I'd point out that MIN of TEST is better than MAX of BASE, which means
> there's always an improvement for this config. So jitter here means it's
> changing between better and more better :) and not between worse and (more)
> better.
>
> The annoying part of course is that for other configs it's consistently the
> opposite.

Hi Vlastimil,
I ran my mmap stress test that runs 20000 cycles of mmapping 50 VMAs,
faulting them in then unmapping and timing only mmap and munmap calls.
This is not a realistic scenario but works well for A/B comparison.

The numbers are below with sheaves showing a clear improvement:

Baseline
            avg             stdev
mmap        2.621073        0.2525161631
munmap      2.292965        0.008831973052
total       4.914038        0.2572620923

Sheaves
            avg            stdev           avg_diff        stdev_diff
mmap        1.561220667    0.07748897037   -40.44%        -69.31%
munmap      2.042071       0.03603083448   -10.94%        307.96%
total       3.603291667    0.113209047     -26.67%        -55.99%

Stdev for munmap went high but I see that there was only one run that
was very different from others, so that might have been just a noisy
run.

One thing I noticed is that with my stress testing mmap/munmap in a
loop we get lots of in-flight freed-by-RCU sheaves before the grace
period arrives and they get freed in bulk. Note that Android enables
lazy RCU config, so that affects the grace period and makes it longer
than normal. This results in sheaves being freed in bulk and when that
happens, the barn gets quickly full (we only have 10
(MAX_FULL_SHEAVES) free slots), the rest of the sheaves being freed
are destroyed instead of being reused.

I tried two modifications:
1. Use call_rcu_hurry() instead of call_rcu() when freeing the
sheaves. This should remove the effects of lazy RCU;
2. Keep a running count of in-flight RCU-freed sheaves and once it
reaches the number of free slots for full sheaves in the barn, I
schedule an rcu_barrier() to free all these in-flight sheaves. Note
that I added an additional condition to skip this RCU flush if the
number of free slots for full sheaves is less than MAX_FULL_SHEAVES/2.
That should prevent flushing to free only a small number of sheaves.

With these modifications the numbers get even better:

Sheaves with call_rcu_hurry
            avg                            avg_diff (vs Baseline)
mmap        1.279308                       -51.19%
munmap      1.983921                       -13.48%
total       3.263228                       -33.59%

Sheaves with rcu_barrier
            avg                            avg_diff (vs Baseline)
mmap        1.210455                       -53.82%
munmap      1.963739                       -14.36%
total       3.174194                       -35.41%

I didn't capture stdev because I did not run as many times as the
first two configurations.

Again, the tight loop in my test is not representative of a real
workloads and the numbers are definitely affected by the use of lazy
RCU mode in Android. While this information can be used for later
optimizations, I don't think these findings should block current
deployment of the sheaves.
Thanks,
Suren.


>
> > Of course, if adjustments can be made to keep the increase in mean while
> > keeping STDDEV low, that would of course be even better.
> >
> >                                                       Thanx, Paul
> >
> >> >|            | MIN        | MAX        | MEAN       | MEDIAN     | STDDEV     |
> >> >|:-----------|:-----------|:-----------|:-----------|:-----------|:-----------|
> >> >| brk1_8_processes
> >> >| BASE       | 7,667,220  | 7,705,767  | 7,682,782  | 7,676,211  | 12,733     |
> >> >| TEST       | 9,477,395  | 10,053,058 | 9,878,753  | 9,959,360  | 182,014    |
> >> >| %          | +23.61%    | +30.46%    | +28.58%    | +29.74%    | +1,329.46% |
> >> >
> >> >| mmap2_256_processes
> >> >| BASE       | 7,483,929  | 7,532,461  | 7,491,876  | 7,489,398  | 11,134     |
> >> >| TEST       | 11,580,023 | 16,508,551 | 15,337,145 | 15,943,608 | 1,489,489  |
> >> >| %          | +54.73%    | +119.17%   | +104.72%   | +112.88%   | +13,276.75%|
> >>
>
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Uladzislau Rezki 2 weeks, 5 days ago
On Tue, Sep 16, 2025 at 10:09:18AM -0700, Suren Baghdasaryan wrote:
> On Mon, Sep 15, 2025 at 8:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 9/15/25 14:13, Paul E. McKenney wrote:
> > > On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> > >>
> > >> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> > >> >
> > >> >Summary of the results:
> >
> > In any case, thanks a lot for the results!
> >
> > >> >- Significant change (meaning >10% difference
> > >> >  between base and experiment) on will-it-scale
> > >> >  tests in AMD.
> > >> >
> > >> >Summary of AMD will-it-scale test changes:
> > >> >
> > >> >Number of runs : 15
> > >> >Direction      : + is good
> > >>
> > >> If STDDEV grows more than mean, there is more jitter,
> > >> which is not "good".
> > >
> > > This is true.  On the other hand, the mean grew way more in absolute
> > > terms than did STDDEV.  So might this be a reasonable tradeoff?
> >
> > Also I'd point out that MIN of TEST is better than MAX of BASE, which means
> > there's always an improvement for this config. So jitter here means it's
> > changing between better and more better :) and not between worse and (more)
> > better.
> >
> > The annoying part of course is that for other configs it's consistently the
> > opposite.
> 
> Hi Vlastimil,
> I ran my mmap stress test that runs 20000 cycles of mmapping 50 VMAs,
> faulting them in then unmapping and timing only mmap and munmap calls.
> This is not a realistic scenario but works well for A/B comparison.
> 
> The numbers are below with sheaves showing a clear improvement:
> 
> Baseline
>             avg             stdev
> mmap        2.621073        0.2525161631
> munmap      2.292965        0.008831973052
> total       4.914038        0.2572620923
> 
> Sheaves
>             avg            stdev           avg_diff        stdev_diff
> mmap        1.561220667    0.07748897037   -40.44%        -69.31%
> munmap      2.042071       0.03603083448   -10.94%        307.96%
> total       3.603291667    0.113209047     -26.67%        -55.99%
> 
Could you run your test with dropping below patch?

[PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations

mmap()/munmap(), i assume it is a duration time in average, is the time
in microseconds?

Thank you. 

--
Uladzislau Rezki
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Suren Baghdasaryan 2 weeks, 4 days ago
On Tue, Sep 16, 2025 at 10:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> On Tue, Sep 16, 2025 at 10:09:18AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Sep 15, 2025 at 8:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > >
> > > On 9/15/25 14:13, Paul E. McKenney wrote:
> > > > On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> > > >>
> > > >> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> > > >> >
> > > >> >Summary of the results:
> > >
> > > In any case, thanks a lot for the results!
> > >
> > > >> >- Significant change (meaning >10% difference
> > > >> >  between base and experiment) on will-it-scale
> > > >> >  tests in AMD.
> > > >> >
> > > >> >Summary of AMD will-it-scale test changes:
> > > >> >
> > > >> >Number of runs : 15
> > > >> >Direction      : + is good
> > > >>
> > > >> If STDDEV grows more than mean, there is more jitter,
> > > >> which is not "good".
> > > >
> > > > This is true.  On the other hand, the mean grew way more in absolute
> > > > terms than did STDDEV.  So might this be a reasonable tradeoff?
> > >
> > > Also I'd point out that MIN of TEST is better than MAX of BASE, which means
> > > there's always an improvement for this config. So jitter here means it's
> > > changing between better and more better :) and not between worse and (more)
> > > better.
> > >
> > > The annoying part of course is that for other configs it's consistently the
> > > opposite.
> >
> > Hi Vlastimil,
> > I ran my mmap stress test that runs 20000 cycles of mmapping 50 VMAs,
> > faulting them in then unmapping and timing only mmap and munmap calls.
> > This is not a realistic scenario but works well for A/B comparison.
> >
> > The numbers are below with sheaves showing a clear improvement:
> >
> > Baseline
> >             avg             stdev
> > mmap        2.621073        0.2525161631
> > munmap      2.292965        0.008831973052
> > total       4.914038        0.2572620923
> >
> > Sheaves
> >             avg            stdev           avg_diff        stdev_diff
> > mmap        1.561220667    0.07748897037   -40.44%        -69.31%
> > munmap      2.042071       0.03603083448   -10.94%        307.96%
> > total       3.603291667    0.113209047     -26.67%        -55.99%
> >
> Could you run your test with dropping below patch?

Sure, will try later today and report.

>
> [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
>
> mmap()/munmap(), i assume it is a duration time in average, is the time
> in microseconds?

Yeah, it ends up being in microseconds. The actual reported time is
the total time in seconds that all mmap/munmap in the test consumed.
With 20000 cycles of 50 mmap/munmap calls we end up with 1000000
syscalls, so the number can be considered as duration in microseconds
for a single call.

>
> Thank you.
>
> --
> Uladzislau Rezki
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Suren Baghdasaryan 2 weeks, 4 days ago
On Wed, Sep 17, 2025 at 9:14 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Sep 16, 2025 at 10:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > On Tue, Sep 16, 2025 at 10:09:18AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Sep 15, 2025 at 8:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > >
> > > > On 9/15/25 14:13, Paul E. McKenney wrote:
> > > > > On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> > > > >>
> > > > >> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> > > > >> >
> > > > >> >Summary of the results:
> > > >
> > > > In any case, thanks a lot for the results!
> > > >
> > > > >> >- Significant change (meaning >10% difference
> > > > >> >  between base and experiment) on will-it-scale
> > > > >> >  tests in AMD.
> > > > >> >
> > > > >> >Summary of AMD will-it-scale test changes:
> > > > >> >
> > > > >> >Number of runs : 15
> > > > >> >Direction      : + is good
> > > > >>
> > > > >> If STDDEV grows more than mean, there is more jitter,
> > > > >> which is not "good".
> > > > >
> > > > > This is true.  On the other hand, the mean grew way more in absolute
> > > > > terms than did STDDEV.  So might this be a reasonable tradeoff?
> > > >
> > > > Also I'd point out that MIN of TEST is better than MAX of BASE, which means
> > > > there's always an improvement for this config. So jitter here means it's
> > > > changing between better and more better :) and not between worse and (more)
> > > > better.
> > > >
> > > > The annoying part of course is that for other configs it's consistently the
> > > > opposite.
> > >
> > > Hi Vlastimil,
> > > I ran my mmap stress test that runs 20000 cycles of mmapping 50 VMAs,
> > > faulting them in then unmapping and timing only mmap and munmap calls.
> > > This is not a realistic scenario but works well for A/B comparison.
> > >
> > > The numbers are below with sheaves showing a clear improvement:
> > >
> > > Baseline
> > >             avg             stdev
> > > mmap        2.621073        0.2525161631
> > > munmap      2.292965        0.008831973052
> > > total       4.914038        0.2572620923
> > >
> > > Sheaves
> > >             avg            stdev           avg_diff        stdev_diff
> > > mmap        1.561220667    0.07748897037   -40.44%        -69.31%
> > > munmap      2.042071       0.03603083448   -10.94%        307.96%
> > > total       3.603291667    0.113209047     -26.67%        -55.99%
> > >
> > Could you run your test with dropping below patch?
>
> Sure, will try later today and report.

Sheaves with [04/23] patch reverted:

            avg             avg_diff
mmap     2.143948        -18.20%
munmap     2.343707        2.21%
total     4.487655        -8.68%


>
> >
> > [PATCH v8 04/23] slab: add sheaf support for batching kfree_rcu() operations
> >
> > mmap()/munmap(), i assume it is a duration time in average, is the time
> > in microseconds?
>
> Yeah, it ends up being in microseconds. The actual reported time is
> the total time in seconds that all mmap/munmap in the test consumed.
> With 20000 cycles of 50 mmap/munmap calls we end up with 1000000
> syscalls, so the number can be considered as duration in microseconds
> for a single call.
>
> >
> > Thank you.
> >
> > --
> > Uladzislau Rezki
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Uladzislau Rezki 2 weeks, 3 days ago
On Wed, Sep 17, 2025 at 04:59:41PM -0700, Suren Baghdasaryan wrote:
> On Wed, Sep 17, 2025 at 9:14 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Sep 16, 2025 at 10:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 10:09:18AM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Sep 15, 2025 at 8:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > >
> > > > > On 9/15/25 14:13, Paul E. McKenney wrote:
> > > > > > On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> > > > > >>
> > > > > >> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> > > > > >> >
> > > > > >> >Summary of the results:
> > > > >
> > > > > In any case, thanks a lot for the results!
> > > > >
> > > > > >> >- Significant change (meaning >10% difference
> > > > > >> >  between base and experiment) on will-it-scale
> > > > > >> >  tests in AMD.
> > > > > >> >
> > > > > >> >Summary of AMD will-it-scale test changes:
> > > > > >> >
> > > > > >> >Number of runs : 15
> > > > > >> >Direction      : + is good
> > > > > >>
> > > > > >> If STDDEV grows more than mean, there is more jitter,
> > > > > >> which is not "good".
> > > > > >
> > > > > > This is true.  On the other hand, the mean grew way more in absolute
> > > > > > terms than did STDDEV.  So might this be a reasonable tradeoff?
> > > > >
> > > > > Also I'd point out that MIN of TEST is better than MAX of BASE, which means
> > > > > there's always an improvement for this config. So jitter here means it's
> > > > > changing between better and more better :) and not between worse and (more)
> > > > > better.
> > > > >
> > > > > The annoying part of course is that for other configs it's consistently the
> > > > > opposite.
> > > >
> > > > Hi Vlastimil,
> > > > I ran my mmap stress test that runs 20000 cycles of mmapping 50 VMAs,
> > > > faulting them in then unmapping and timing only mmap and munmap calls.
> > > > This is not a realistic scenario but works well for A/B comparison.
> > > >
> > > > The numbers are below with sheaves showing a clear improvement:
> > > >
> > > > Baseline
> > > >             avg             stdev
> > > > mmap        2.621073        0.2525161631
> > > > munmap      2.292965        0.008831973052
> > > > total       4.914038        0.2572620923
> > > >
> > > > Sheaves
> > > >             avg            stdev           avg_diff        stdev_diff
> > > > mmap        1.561220667    0.07748897037   -40.44%        -69.31%
> > > > munmap      2.042071       0.03603083448   -10.94%        307.96%
> > > > total       3.603291667    0.113209047     -26.67%        -55.99%
> > > >
> > > Could you run your test with dropping below patch?
> >
> > Sure, will try later today and report.
> 
> Sheaves with [04/23] patch reverted:
> 
>             avg             avg_diff
> mmap     2.143948        -18.20%
> munmap     2.343707        2.21%
> total     4.487655        -8.68%
> 
With offloading over sheaves the mmap/munmap is faster, i assume it is
because of same objects are reused from the sheaves after reclaim. Whereas we,
kvfree_rcu() just free them.
 
Thank you for your results.

--
Uladzislau Rezki
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Liam R. Howlett 2 weeks, 3 days ago
* Uladzislau Rezki <urezki@gmail.com> [250918 07:50]:
> On Wed, Sep 17, 2025 at 04:59:41PM -0700, Suren Baghdasaryan wrote:
> > On Wed, Sep 17, 2025 at 9:14 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Sep 16, 2025 at 10:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > On Tue, Sep 16, 2025 at 10:09:18AM -0700, Suren Baghdasaryan wrote:
> > > > > On Mon, Sep 15, 2025 at 8:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > > >
> > > > > > On 9/15/25 14:13, Paul E. McKenney wrote:
> > > > > > > On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> > > > > > >>
> > > > > > >> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> > > > > > >> >
> > > > > > >> >Summary of the results:
> > > > > >
> > > > > > In any case, thanks a lot for the results!
> > > > > >
> > > > > > >> >- Significant change (meaning >10% difference
> > > > > > >> >  between base and experiment) on will-it-scale
> > > > > > >> >  tests in AMD.
> > > > > > >> >
> > > > > > >> >Summary of AMD will-it-scale test changes:
> > > > > > >> >
> > > > > > >> >Number of runs : 15
> > > > > > >> >Direction      : + is good
> > > > > > >>
> > > > > > >> If STDDEV grows more than mean, there is more jitter,
> > > > > > >> which is not "good".
> > > > > > >
> > > > > > > This is true.  On the other hand, the mean grew way more in absolute
> > > > > > > terms than did STDDEV.  So might this be a reasonable tradeoff?
> > > > > >
> > > > > > Also I'd point out that MIN of TEST is better than MAX of BASE, which means
> > > > > > there's always an improvement for this config. So jitter here means it's
> > > > > > changing between better and more better :) and not between worse and (more)
> > > > > > better.
> > > > > >
> > > > > > The annoying part of course is that for other configs it's consistently the
> > > > > > opposite.
> > > > >
> > > > > Hi Vlastimil,
> > > > > I ran my mmap stress test that runs 20000 cycles of mmapping 50 VMAs,
> > > > > faulting them in then unmapping and timing only mmap and munmap calls.
> > > > > This is not a realistic scenario but works well for A/B comparison.
> > > > >
> > > > > The numbers are below with sheaves showing a clear improvement:
> > > > >
> > > > > Baseline
> > > > >             avg             stdev
> > > > > mmap        2.621073        0.2525161631
> > > > > munmap      2.292965        0.008831973052
> > > > > total       4.914038        0.2572620923
> > > > >
> > > > > Sheaves
> > > > >             avg            stdev           avg_diff        stdev_diff
> > > > > mmap        1.561220667    0.07748897037   -40.44%        -69.31%
> > > > > munmap      2.042071       0.03603083448   -10.94%        307.96%
> > > > > total       3.603291667    0.113209047     -26.67%        -55.99%
> > > > >
> > > > Could you run your test with dropping below patch?
> > >
> > > Sure, will try later today and report.
> > 
> > Sheaves with [04/23] patch reverted:
> > 
> >             avg             avg_diff
> > mmap     2.143948        -18.20%
> > munmap     2.343707        2.21%
> > total     4.487655        -8.68%
> > 
> With offloading over sheaves the mmap/munmap is faster, i assume it is
> because of same objects are reused from the sheaves after reclaim. Whereas we,
> kvfree_rcu() just free them.

Sorry, I am having trouble following where you think the speed up is
coming from.

Can you clarify what you mean by offloading and reclaim in this context?

Thanks,
Liam
Re: Benchmarking [PATCH v5 00/14] SLUB percpu sheaves
Posted by Uladzislau Rezki 2 weeks, 2 days ago
On Thu, Sep 18, 2025 at 11:29:14AM -0400, Liam R. Howlett wrote:
> * Uladzislau Rezki <urezki@gmail.com> [250918 07:50]:
> > On Wed, Sep 17, 2025 at 04:59:41PM -0700, Suren Baghdasaryan wrote:
> > > On Wed, Sep 17, 2025 at 9:14 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Tue, Sep 16, 2025 at 10:19 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > On Tue, Sep 16, 2025 at 10:09:18AM -0700, Suren Baghdasaryan wrote:
> > > > > > On Mon, Sep 15, 2025 at 8:22 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > > > > > >
> > > > > > > On 9/15/25 14:13, Paul E. McKenney wrote:
> > > > > > > > On Mon, Sep 15, 2025 at 09:51:25AM +0200, Jan Engelhardt wrote:
> > > > > > > >>
> > > > > > > >> On Saturday 2025-09-13 02:09, Sudarsan Mahendran wrote:
> > > > > > > >> >
> > > > > > > >> >Summary of the results:
> > > > > > >
> > > > > > > In any case, thanks a lot for the results!
> > > > > > >
> > > > > > > >> >- Significant change (meaning >10% difference
> > > > > > > >> >  between base and experiment) on will-it-scale
> > > > > > > >> >  tests in AMD.
> > > > > > > >> >
> > > > > > > >> >Summary of AMD will-it-scale test changes:
> > > > > > > >> >
> > > > > > > >> >Number of runs : 15
> > > > > > > >> >Direction      : + is good
> > > > > > > >>
> > > > > > > >> If STDDEV grows more than mean, there is more jitter,
> > > > > > > >> which is not "good".
> > > > > > > >
> > > > > > > > This is true.  On the other hand, the mean grew way more in absolute
> > > > > > > > terms than did STDDEV.  So might this be a reasonable tradeoff?
> > > > > > >
> > > > > > > Also I'd point out that MIN of TEST is better than MAX of BASE, which means
> > > > > > > there's always an improvement for this config. So jitter here means it's
> > > > > > > changing between better and more better :) and not between worse and (more)
> > > > > > > better.
> > > > > > >
> > > > > > > The annoying part of course is that for other configs it's consistently the
> > > > > > > opposite.
> > > > > >
> > > > > > Hi Vlastimil,
> > > > > > I ran my mmap stress test that runs 20000 cycles of mmapping 50 VMAs,
> > > > > > faulting them in then unmapping and timing only mmap and munmap calls.
> > > > > > This is not a realistic scenario but works well for A/B comparison.
> > > > > >
> > > > > > The numbers are below with sheaves showing a clear improvement:
> > > > > >
> > > > > > Baseline
> > > > > >             avg             stdev
> > > > > > mmap        2.621073        0.2525161631
> > > > > > munmap      2.292965        0.008831973052
> > > > > > total       4.914038        0.2572620923
> > > > > >
> > > > > > Sheaves
> > > > > >             avg            stdev           avg_diff        stdev_diff
> > > > > > mmap        1.561220667    0.07748897037   -40.44%        -69.31%
> > > > > > munmap      2.042071       0.03603083448   -10.94%        307.96%
> > > > > > total       3.603291667    0.113209047     -26.67%        -55.99%
> > > > > >
> > > > > Could you run your test with dropping below patch?
> > > >
> > > > Sure, will try later today and report.
> > > 
> > > Sheaves with [04/23] patch reverted:
> > > 
> > >             avg             avg_diff
> > > mmap     2.143948        -18.20%
> > > munmap     2.343707        2.21%
> > > total     4.487655        -8.68%
> > > 
> > With offloading over sheaves the mmap/munmap is faster, i assume it is
> > because of same objects are reused from the sheaves after reclaim. Whereas we,
> > kvfree_rcu() just free them.
> 
> Sorry, I am having trouble following where you think the speed up is
> coming from.
> 
> Can you clarify what you mean by offloading and reclaim in this context?
> 
[1] <Sheaves series>
             avg            stdev           avg_diff        stdev_diff
 mmap        1.561220667    0.07748897037   -40.44%        -69.31%
 munmap      2.042071       0.03603083448   -10.94%        307.96%
 total       3.603291667    0.113209047     -26.67%        -55.99%
[1] <Sheaves series>

[2] <Sheaves series but with [04/23] patch reverted>
             avg             avg_diff
 mmap     2.143948        -18.20%
 munmap     2.343707        2.21%
 total     4.487655        -8.68%
[2] <Sheaves series but with [04/23] patch reverted>

I meant those two data results. It is comparison of freeing over
or to "sheaves" and without it in the kvfree_rcu() path.

--
Uladzislau Rezki