include/linux/local_lock.h | 70 ++ include/linux/local_lock_internal.h | 146 ++++ include/linux/slab.h | 50 ++ lib/maple_tree.c | 11 +- mm/slab.h | 4 + mm/slab_common.c | 26 +- mm/slub.c | 1403 +++++++++++++++++++++++++++++++-- tools/include/linux/slab.h | 65 +- tools/testing/shared/linux.c | 108 ++- tools/testing/shared/linux/rcupdate.h | 22 + 10 files changed, 1840 insertions(+), 65 deletions(-)
Hi,
This is the v2 RFC to add an opt-in percpu array-based caching layer to
SLUB. The name "sheaf" was invented by Matthew so we don't call it
magazine like the original Bonwick paper. The per-NUMA-node cache of
sheaves is thus called "barn".
This may seem similar to the arrays in SLAB, but the main differences
are:
- opt-in, not used for every cache
- does not distinguish NUMA locality, thus no "alien" arrays that would
need periodical flushing
- improves kfree_rcu() handling
- API for obtaining a preallocated sheaf that can be used for guaranteed
and efficient allocations in a restricted context, when the upper
bound for needed objects is known but rarely reached
The motivation comes mainly from the ongoing work related to VMA
scalability and the related maple tree operations. This is why maple
tree nodes are sheaf-enabled in the RFC, but it's not a full conversion
that would take benefits of the improved preallocation API. The VMA part
is currently left out as it's expected that Suren will land the VMA
TYPESAFE_BY_RCU conversion [3] soon and there would be conflict with that.
With both series applied it means just adding a line to kmem_cache_args
in proc_caches_init().
Some performance benefits were measured by Suren and Liam in previous
versions. I hope to have those numbers posted public as both this work
and the VMA and maple tree changes stabilize.
A sheaf-enabled cache has the following expected advantages:
- Cheaper fast paths. For allocations, instead of local double cmpxchg,
after Patch 5 it's preempt_disable() and no atomic operations. Same for
freeing, which is normally a local double cmpxchg only for a short
term allocations (so the same slab is still active on the same cpu when
freeing the object) and a more costly locked double cmpxchg otherwise.
The downside is the lack of NUMA locality guarantees for the allocated
objects.
- kfree_rcu() batching and recycling. kfree_rcu() will put objects to a
separate percpu sheaf and only submit the whole sheaf to call_rcu()
when full. After the grace period, the sheaf can be used for
allocations, which is more efficient than freeing and reallocating
individual slab objects (even with the batching done by kfree_rcu()
implementation itself). In case only some cpus are allowed to handle rcu
callbacks, the sheaf can still be made available to other cpus on the
same node via the shared barn. The maple_node cache uses kfree_rcu() and
thus can benefit from this.
- Preallocation support. A prefilled sheaf can be privately borrowed for
a short term operation that is not allowed to block in the middle and
may need to allocate some objects. If an upper bound (worst case) for
the number of allocations is known, but only much fewer allocations
actually needed on average, borrowing and returning a sheaf is much more
efficient then a bulk allocation for the worst case followed by a bulk
free of the many unused objects. Maple tree write operations should
benefit from this.
Patch 1 implements the basic sheaf functionality and using
local_lock_irqsave() for percpu sheaf locking.
Patch 2 adds the kfree_rcu() support.
Patch 3 is copied from the series "bpf, mm: Introduce try_alloc_pages()"
[2] to introduce a variant of local_lock that has a trylock operation.
Patch 4 adds a variant of the trylock without _irqsave. Patch 5 converts
percpu sheaves locking to the new variant of the lock.
Patch 6 implements borrowing prefilled sheaves, with maple tree being the
ancticipated user.
Patch 7 seeks to reduce barn spinlock contention. Separately for
possible evaluation.
Patches 8 and 9 by Liam add testing stubs that maple tree will use in
its userspace tests.
Patch 10 enables sheaves for the maple tree node cache, but does not
take advantage of prefilling yet.
(RFC) LIMITATIONS:
- with slub_debug enabled, objects in sheaves are considered allocated
so allocation/free stacktraces may become imprecise and checking of
e.g. redzone violations may be delayed
GIT TREES:
this series: https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v2
To avoid conflicts, the series requires (and the branch above is based
on) the kfree_rcu() code refactoring scheduled for 6.15:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-6.15/kfree_rcu_tiny
To facilitate testing/benchmarking, there's also a branch with Liam's
maple tree changes from [4] adapted to the current code:
https://git.kernel.org/vbabka/l/slub-percpu-sheaves-v2-maple
There are also two optimization patches for sheaves by Liam for
evaluation as I suspect they might not be universal win.
Vlastimil
[1] https://lore.kernel.org/all/20241111205506.3404479-1-surenb@google.com/
[2] https://lore.kernel.org/all/20250213033556.9534-4-alexei.starovoitov@gmail.com/
[3] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/
[4] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/slub-percpu-sheaves-v2
---
Changes in v2:
- Removed kfree_rcu() destructors support as VMAs will not need it
anymore after [3] is merged.
- Changed to localtry_lock_t borrowed from [2] instead of an own
implementation of the same idea.
- Many fixes and improvements thanks to Liam's adoption for maple tree
nodes.
- Userspace Testing stubs by Liam.
- Reduced limitations/todos - hooking to kfree_rcu() is complete,
prefilled sheaves can exceed cache's sheaf_capacity.
- Link to v1: https://lore.kernel.org/r/20241112-slub-percpu-caches-v1-0-ddc0bdc27e05@suse.cz
---
Liam R. Howlett (2):
tools: Add testing support for changes to rcu and slab for sheaves
tools: Add sheafs support to testing infrastructure
Sebastian Andrzej Siewior (1):
locking/local_lock: Introduce localtry_lock_t
Vlastimil Babka (7):
slab: add opt-in caching layer of percpu sheaves
slab: add sheaf support for batching kfree_rcu() operations
locking/local_lock: add localtry_trylock()
slab: switch percpu sheaves locking to localtry_lock
slab: sheaf prefilling for guaranteed allocations
slab: determine barn status racily outside of lock
maple_tree: use percpu sheaves for maple_node_cache
include/linux/local_lock.h | 70 ++
include/linux/local_lock_internal.h | 146 ++++
include/linux/slab.h | 50 ++
lib/maple_tree.c | 11 +-
mm/slab.h | 4 +
mm/slab_common.c | 26 +-
mm/slub.c | 1403 +++++++++++++++++++++++++++++++--
tools/include/linux/slab.h | 65 +-
tools/testing/shared/linux.c | 108 ++-
tools/testing/shared/linux/rcupdate.h | 22 +
10 files changed, 1840 insertions(+), 65 deletions(-)
---
base-commit: 379487e17ca406b47392e7ab6cf35d1c3bacb371
change-id: 20231128-slub-percpu-caches-9441892011d7
prerequisite-message-id: 20250203-slub-tiny-kfree_rcu-v1-0-d4428bf9a8a1@suse.cz
prerequisite-patch-id: 1a4af92b5eb1b8bfc86bac8d7fc1ef0963e7d9d6
prerequisite-patch-id: f24a39c38103b7e09fbf2e6f84e6108499ab7980
prerequisite-patch-id: 23e90b23482f4775c95295821dd779ba4e3712e9
prerequisite-patch-id: 5c53a619477acdce07071abec0f40e79501ea40b
Best regards,
--
Vlastimil Babka <vbabka@suse.cz>
On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote: > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > after Patch 5 it's preempt_disable() and no atomic operations. Same for > freeing, which is normally a local double cmpxchg only for a short > term allocations (so the same slab is still active on the same cpu when > freeing the object) and a more costly locked double cmpxchg otherwise. > The downside is the lack of NUMA locality guarantees for the allocated > objects. Is that really cheaper than a local non locked double cmpxchg? Especially if you now have to use pushf/popf... > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a > separate percpu sheaf and only submit the whole sheaf to call_rcu() > when full. After the grace period, the sheaf can be used for > allocations, which is more efficient than freeing and reallocating > individual slab objects (even with the batching done by kfree_rcu() > implementation itself). In case only some cpus are allowed to handle rcu > callbacks, the sheaf can still be made available to other cpus on the > same node via the shared barn. The maple_node cache uses kfree_rcu() and > thus can benefit from this. Have you looked at fs/bcachefs/rcu_pending.c?
On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote: > > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > > after Patch 5 it's preempt_disable() and no atomic operations. Same for > > freeing, which is normally a local double cmpxchg only for a short > > term allocations (so the same slab is still active on the same cpu when > > freeing the object) and a more costly locked double cmpxchg otherwise. > > The downside is the lack of NUMA locality guarantees for the allocated > > objects. > > Is that really cheaper than a local non locked double cmpxchg? Don't know about this particular part but testing sheaves with maple node cache and stress testing mmap/munmap syscalls shows performance benefits as long as there is some delay to let kfree_rcu() do its job. I'm still gathering results and will most likely post them tomorrow. > > Especially if you now have to use pushf/popf... > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a > > separate percpu sheaf and only submit the whole sheaf to call_rcu() > > when full. After the grace period, the sheaf can be used for > > allocations, which is more efficient than freeing and reallocating > > individual slab objects (even with the batching done by kfree_rcu() > > implementation itself). In case only some cpus are allowed to handle rcu > > callbacks, the sheaf can still be made available to other cpus on the > > same node via the shared barn. The maple_node cache uses kfree_rcu() and > > thus can benefit from this. > > Have you looked at fs/bcachefs/rcu_pending.c?
On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote: > > > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > > > after Patch 5 it's preempt_disable() and no atomic operations. Same for > > > freeing, which is normally a local double cmpxchg only for a short > > > term allocations (so the same slab is still active on the same cpu when > > > freeing the object) and a more costly locked double cmpxchg otherwise. > > > The downside is the lack of NUMA locality guarantees for the allocated > > > objects. > > > > Is that really cheaper than a local non locked double cmpxchg? > > Don't know about this particular part but testing sheaves with maple > node cache and stress testing mmap/munmap syscalls shows performance > benefits as long as there is some delay to let kfree_rcu() do its job. > I'm still gathering results and will most likely post them tomorrow. Here are the promised test results: First I ran an Android app cycle test comparing the baseline against sheaves used for maple tree nodes (as this patchset implements). I registered about 3% improvement in app launch times, indicating improvement in mmap syscall performance. Next I ran an mmap stress test which maps 5 1-page readable file-backed areas, faults them in and finally unmaps them, timing mmap syscalls. Repeats that 200000 cycles and reports the total time. Average of 10 such runs is used as the final result. 3 configurations were tested: 1. Sheaves used for maple tree nodes only (this patchset). 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1]. This patchset avoids allocating additional vm_lock structure on each mmap syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache. 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache. The values represent the total time it took to perform mmap syscalls, less is better. (1) baseline control Little core 7.58327 6.614939 (-12.77%) Medium core 2.125315 1.428702 (-32.78%) Big core 0.514673 0.422948 (-17.82%) (2) baseline control Little core 7.58327 5.141478 (-32.20%) Medium core 2.125315 0.427692 (-79.88%) Big core 0.514673 0.046642 (-90.94%) (3) baseline control Little core 7.58327 4.779624 (-36.97%) Medium core 2.125315 0.450368 (-78.81%) Big core 0.514673 0.037776 (-92.66%) Results in (3) vs (2) indicate that using sheaves for vm_area_struct yields slightly better averages and I noticed that this was mostly due to sheaves results missing occasional spikes that worsened TYPESAFE_BY_RCU averages (the results seemed more stable with sheaves). [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/ > > > > > Especially if you now have to use pushf/popf... > > > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a > > > separate percpu sheaf and only submit the whole sheaf to call_rcu() > > > when full. After the grace period, the sheaf can be used for > > > allocations, which is more efficient than freeing and reallocating > > > individual slab objects (even with the batching done by kfree_rcu() > > > implementation itself). In case only some cpus are allowed to handle rcu > > > callbacks, the sheaf can still be made available to other cpus on the > > > same node via the shared barn. The maple_node cache uses kfree_rcu() and > > > thus can benefit from this. > > > > Have you looked at fs/bcachefs/rcu_pending.c?
On 2/24/25 02:36, Suren Baghdasaryan wrote: > On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote: >> >> Don't know about this particular part but testing sheaves with maple >> node cache and stress testing mmap/munmap syscalls shows performance >> benefits as long as there is some delay to let kfree_rcu() do its job. >> I'm still gathering results and will most likely post them tomorrow. Without such delay, the perf is same or worse? > Here are the promised test results: > > First I ran an Android app cycle test comparing the baseline against sheaves > used for maple tree nodes (as this patchset implements). I registered about > 3% improvement in app launch times, indicating improvement in mmap syscall > performance. There was no artificial 500us delay added for this test, right? > Next I ran an mmap stress test which maps 5 1-page readable file-backed > areas, faults them in and finally unmaps them, timing mmap syscalls. > Repeats that 200000 cycles and reports the total time. Average of 10 such > runs is used as the final result. > 3 configurations were tested: > > 1. Sheaves used for maple tree nodes only (this patchset). > > 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1]. > This patchset avoids allocating additional vm_lock structure on each mmap > syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache. > > 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock > to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace > TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache. Hm why we can't use both? I don't think any kmem_cache_create check makes them exclusive? TYPESAFE_BY_RCU only affects how slab pages are freed, it doesn't e.g. delay reuse of individual objects, and caching in a sheaf doesn't write to the object. Am I missing something? > The values represent the total time it took to perform mmap syscalls, less is > better. > > (1) baseline control > Little core 7.58327 6.614939 (-12.77%) > Medium core 2.125315 1.428702 (-32.78%) > Big core 0.514673 0.422948 (-17.82%) > > (2) baseline control > Little core 7.58327 5.141478 (-32.20%) > Medium core 2.125315 0.427692 (-79.88%) > Big core 0.514673 0.046642 (-90.94%) > > (3) baseline control > Little core 7.58327 4.779624 (-36.97%) > Medium core 2.125315 0.450368 (-78.81%) > Big core 0.514673 0.037776 (-92.66%) > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct > yields slightly better averages and I noticed that this was mostly due > to sheaves results missing occasional spikes that worsened > TYPESAFE_BY_RCU averages (the results seemed more stable with > sheaves). Thanks a lot, that looks promising! > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/ >
On Mon, Feb 24, 2025 at 12:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 2/24/25 02:36, Suren Baghdasaryan wrote: > > On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> Don't know about this particular part but testing sheaves with maple > >> node cache and stress testing mmap/munmap syscalls shows performance > >> benefits as long as there is some delay to let kfree_rcu() do its job. > >> I'm still gathering results and will most likely post them tomorrow. > > Without such delay, the perf is same or worse? The perf is about the same if there is no delay. > > > Here are the promised test results: > > > > First I ran an Android app cycle test comparing the baseline against sheaves > > used for maple tree nodes (as this patchset implements). I registered about > > 3% improvement in app launch times, indicating improvement in mmap syscall > > performance. > > There was no artificial 500us delay added for this test, right? Correct. No artificial changes in this test. > > > Next I ran an mmap stress test which maps 5 1-page readable file-backed > > areas, faults them in and finally unmaps them, timing mmap syscalls. > > Repeats that 200000 cycles and reports the total time. Average of 10 such > > runs is used as the final result. > > 3 configurations were tested: > > > > 1. Sheaves used for maple tree nodes only (this patchset). > > > > 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1]. > > This patchset avoids allocating additional vm_lock structure on each mmap > > syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache. > > > > 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock > > to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace > > TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache. > > Hm why we can't use both? I don't think any kmem_cache_create check makes > them exclusive? TYPESAFE_BY_RCU only affects how slab pages are freed, it > doesn't e.g. delay reuse of individual objects, and caching in a sheaf > doesn't write to the object. Am I missing something? Ah, I was under impression that to use sheaves I would have to ensure the freeing happens via kfree_rcu()->kfree_rcu_sheaf() path but now that you mentioned that, I guess I could keep using kmem_cache_free() and that would use free_to_pcs() internally... When time comes to free the page, TYPESAFE_BY_RCU will free it after the grace period. I can try that combination as well and see if anything breaks. > > > The values represent the total time it took to perform mmap syscalls, less is > > better. > > > > (1) baseline control > > Little core 7.58327 6.614939 (-12.77%) > > Medium core 2.125315 1.428702 (-32.78%) > > Big core 0.514673 0.422948 (-17.82%) > > > > (2) baseline control > > Little core 7.58327 5.141478 (-32.20%) > > Medium core 2.125315 0.427692 (-79.88%) > > Big core 0.514673 0.046642 (-90.94%) > > > > (3) baseline control > > Little core 7.58327 4.779624 (-36.97%) > > Medium core 2.125315 0.450368 (-78.81%) > > Big core 0.514673 0.037776 (-92.66%) > > > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct > > yields slightly better averages and I noticed that this was mostly due > > to sheaves results missing occasional spikes that worsened > > TYPESAFE_BY_RCU averages (the results seemed more stable with > > sheaves). > > Thanks a lot, that looks promising! Indeed, that looks better than I expected :) Cheers! > > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/ > > >
On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Mon, Feb 24, 2025 at 12:53 PM Vlastimil Babka <vbabka@suse.cz> wrote: > > > > On 2/24/25 02:36, Suren Baghdasaryan wrote: > > > On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote: > > >> > > >> Don't know about this particular part but testing sheaves with maple > > >> node cache and stress testing mmap/munmap syscalls shows performance > > >> benefits as long as there is some delay to let kfree_rcu() do its job. > > >> I'm still gathering results and will most likely post them tomorrow. > > > > Without such delay, the perf is same or worse? > > The perf is about the same if there is no delay. > > > > > > Here are the promised test results: > > > > > > First I ran an Android app cycle test comparing the baseline against sheaves > > > used for maple tree nodes (as this patchset implements). I registered about > > > 3% improvement in app launch times, indicating improvement in mmap syscall > > > performance. > > > > There was no artificial 500us delay added for this test, right? > > Correct. No artificial changes in this test. > > > > > > Next I ran an mmap stress test which maps 5 1-page readable file-backed > > > areas, faults them in and finally unmaps them, timing mmap syscalls. > > > Repeats that 200000 cycles and reports the total time. Average of 10 such > > > runs is used as the final result. > > > 3 configurations were tested: > > > > > > 1. Sheaves used for maple tree nodes only (this patchset). > > > > > > 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1]. > > > This patchset avoids allocating additional vm_lock structure on each mmap > > > syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache. > > > > > > 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock > > > to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace > > > TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache. > > > > Hm why we can't use both? I don't think any kmem_cache_create check makes > > them exclusive? TYPESAFE_BY_RCU only affects how slab pages are freed, it > > doesn't e.g. delay reuse of individual objects, and caching in a sheaf > > doesn't write to the object. Am I missing something? > > Ah, I was under impression that to use sheaves I would have to ensure > the freeing happens via kfree_rcu()->kfree_rcu_sheaf() path but now > that you mentioned that, I guess I could keep using kmem_cache_free() > and that would use free_to_pcs() internally... When time comes to free > the page, TYPESAFE_BY_RCU will free it after the grace period. > I can try that combination as well and see if anything breaks. This seems to be working fine. The new configuration is: 4. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock to vm_refcnt conversion [1]. vm_area_struct cache uses both TYPESAFE_BY_RCU and sheaves (but obviously not kfree_rcu_sheaf()). > > > > > > The values represent the total time it took to perform mmap syscalls, less is > > > better. > > > > > > (1) baseline control > > > Little core 7.58327 6.614939 (-12.77%) > > > Medium core 2.125315 1.428702 (-32.78%) > > > Big core 0.514673 0.422948 (-17.82%) > > > > > > (2) baseline control > > > Little core 7.58327 5.141478 (-32.20%) > > > Medium core 2.125315 0.427692 (-79.88%) > > > Big core 0.514673 0.046642 (-90.94%) > > > > > > (3) baseline control > > > Little core 7.58327 4.779624 (-36.97%) > > > Medium core 2.125315 0.450368 (-78.81%) > > > Big core 0.514673 0.037776 (-92.66%) (4) baseline control Little core 7.58327 4.642977 (-38.77%) Medium core 2.125315 0.373692 (-82.42%) Big core 0.514673 0.043613 (-91.53%) I think the difference between (3) and (4) is noise. Thanks, Suren. > > > > > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct > > > yields slightly better averages and I noticed that this was mostly due > > > to sheaves results missing occasional spikes that worsened > > > TYPESAFE_BY_RCU averages (the results seemed more stable with > > > sheaves). > > > > Thanks a lot, that looks promising! > > Indeed, that looks better than I expected :) > Cheers! > > > > > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/ > > > > >
On 2/25/25 21:26, Suren Baghdasaryan wrote: > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote: >> >> > >> > > The values represent the total time it took to perform mmap syscalls, less is >> > > better. >> > > >> > > (1) baseline control >> > > Little core 7.58327 6.614939 (-12.77%) >> > > Medium core 2.125315 1.428702 (-32.78%) >> > > Big core 0.514673 0.422948 (-17.82%) >> > > >> > > (2) baseline control >> > > Little core 7.58327 5.141478 (-32.20%) >> > > Medium core 2.125315 0.427692 (-79.88%) >> > > Big core 0.514673 0.046642 (-90.94%) >> > > >> > > (3) baseline control >> > > Little core 7.58327 4.779624 (-36.97%) >> > > Medium core 2.125315 0.450368 (-78.81%) >> > > Big core 0.514673 0.037776 (-92.66%) > > (4) baseline control > Little core 7.58327 4.642977 (-38.77%) > Medium core 2.125315 0.373692 (-82.42%) > Big core 0.514673 0.043613 (-91.53%) > > I think the difference between (3) and (4) is noise. > Thanks, > Suren. Hi, as we discussed yesterday, it would be useful to set the baseline to include everything before sheaves as that's already on the way to 6.15, so we can see more clearly what sheaves do relative to that. So at this point it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone, thus like in scenario (4)), and benchmark the following: - baseline - vma locking conversion with TYPESAFE_BY_RCU - baseline+maple tree node reduction from mm-unstable (Liam might point out which patches?) - the above + this series + sheaves enabled for vm_area_struct cache - the above + full maple node sheaves conversion [1] - the above + the top-most patches from [1] that are optimizations with a tradeoff (not clear win-win) so it would be good to know if they are useful [1] currently the 4 commits here: https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf" but as Liam noted, they won't cherry pick without conflict once maple tree node reduction is backported, but he's working on a rebase Thanks in advance! >> > > >> > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct >> > > yields slightly better averages and I noticed that this was mostly due >> > > to sheaves results missing occasional spikes that worsened >> > > TYPESAFE_BY_RCU averages (the results seemed more stable with >> > > sheaves). >> > >> > Thanks a lot, that looks promising! >> >> Indeed, that looks better than I expected :) >> Cheers! >> >> > >> > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/ >> > > >> >
* Vlastimil Babka <vbabka@suse.cz> [250304 05:55]: > On 2/25/25 21:26, Suren Baghdasaryan wrote: > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> > > >> > > The values represent the total time it took to perform mmap syscalls, less is > >> > > better. > >> > > > >> > > (1) baseline control > >> > > Little core 7.58327 6.614939 (-12.77%) > >> > > Medium core 2.125315 1.428702 (-32.78%) > >> > > Big core 0.514673 0.422948 (-17.82%) > >> > > > >> > > (2) baseline control > >> > > Little core 7.58327 5.141478 (-32.20%) > >> > > Medium core 2.125315 0.427692 (-79.88%) > >> > > Big core 0.514673 0.046642 (-90.94%) > >> > > > >> > > (3) baseline control > >> > > Little core 7.58327 4.779624 (-36.97%) > >> > > Medium core 2.125315 0.450368 (-78.81%) > >> > > Big core 0.514673 0.037776 (-92.66%) > > > > (4) baseline control > > Little core 7.58327 4.642977 (-38.77%) > > Medium core 2.125315 0.373692 (-82.42%) > > Big core 0.514673 0.043613 (-91.53%) > > > > I think the difference between (3) and (4) is noise. > > Thanks, > > Suren. > > Hi, as we discussed yesterday, it would be useful to set the baseline to > include everything before sheaves as that's already on the way to 6.15, so > we can see more clearly what sheaves do relative to that. So at this point > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone, > thus like in scenario (4)), and benchmark the following: > > - baseline - vma locking conversion with TYPESAFE_BY_RCU > - baseline+maple tree node reduction from mm-unstable (Liam might point out > which patches?) Sid's patches [1] are already in mm-unstable. > - the above + this series + sheaves enabled for vm_area_struct cache > - the above + full maple node sheaves conversion [1] > - the above + the top-most patches from [1] that are optimizations with a > tradeoff (not clear win-win) so it would be good to know if they are useful > > [1] currently the 4 commits here: > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf" > but as Liam noted, they won't cherry pick without conflict once maple tree > node reduction is backported, but he's working on a rebase Rebased maple tree sheaves, patches are here [2]. > > ... Thanks, Liam [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/ [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
On Tue, Mar 4, 2025 at 11:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Vlastimil Babka <vbabka@suse.cz> [250304 05:55]:
> > On 2/25/25 21:26, Suren Baghdasaryan wrote:
> > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >>
> > >> >
> > >> > > The values represent the total time it took to perform mmap syscalls, less is
> > >> > > better.
> > >> > >
> > >> > > (1) baseline control
> > >> > > Little core 7.58327 6.614939 (-12.77%)
> > >> > > Medium core 2.125315 1.428702 (-32.78%)
> > >> > > Big core 0.514673 0.422948 (-17.82%)
> > >> > >
> > >> > > (2) baseline control
> > >> > > Little core 7.58327 5.141478 (-32.20%)
> > >> > > Medium core 2.125315 0.427692 (-79.88%)
> > >> > > Big core 0.514673 0.046642 (-90.94%)
> > >> > >
> > >> > > (3) baseline control
> > >> > > Little core 7.58327 4.779624 (-36.97%)
> > >> > > Medium core 2.125315 0.450368 (-78.81%)
> > >> > > Big core 0.514673 0.037776 (-92.66%)
> > >
> > > (4) baseline control
> > > Little core 7.58327 4.642977 (-38.77%)
> > > Medium core 2.125315 0.373692 (-82.42%)
> > > Big core 0.514673 0.043613 (-91.53%)
> > >
> > > I think the difference between (3) and (4) is noise.
> > > Thanks,
> > > Suren.
> >
> > Hi, as we discussed yesterday, it would be useful to set the baseline to
> > include everything before sheaves as that's already on the way to 6.15, so
> > we can see more clearly what sheaves do relative to that. So at this point
> > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone,
> > thus like in scenario (4)), and benchmark the following:
> >
> > - baseline - vma locking conversion with TYPESAFE_BY_RCU
> > - baseline+maple tree node reduction from mm-unstable (Liam might point out
> > which patches?)
>
> Sid's patches [1] are already in mm-unstable.
>
>
> > - the above + this series + sheaves enabled for vm_area_struct cache
> > - the above + full maple node sheaves conversion [1]
> > - the above + the top-most patches from [1] that are optimizations with a
> > tradeoff (not clear win-win) so it would be good to know if they are useful
> >
> > [1] currently the 4 commits here:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
> > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf"
> > but as Liam noted, they won't cherry pick without conflict once maple tree
> > node reduction is backported, but he's working on a rebase
>
> Rebased maple tree sheaves, patches are here [2].
Hi Folks,
Sorry for the delay. I got the numbers last week but they looked a bit
weird, so I reran the test increasing the number of iterations to make
sure noise is not a factor. That took most of this week. Below are the
results. Please note that I had to backport the patchsets to 6.12
because that's the closest stable Android kernel I can use. I measure
cumulative time to execute mmap syscalls, so the smaller the number
the better mmap performance is:
baseline: 6.12 + vm_lock conversion and TYPESAFE_BY_RCU
config1: baseline + Sid's patches [1]
config2: sheaves RFC
config3: config1 + vm_area_struct with sheaves
config4: config2 + maple_tree Sheaf conversion [2]
config5: config3 + 2 last optimization patches from [3]
config1 config2 config3 config4 config5
Little core -0.10% -10.10% -12.89% -10.02% -13.64%
Mid core -21.05% -37.31% -44.97% -15.81% -22.15%
Big core -17.17% -34.41% -45.68% -11.39% -15.29%
[1] https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
[2] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
[3] https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple
From the numbers, it looks like config4 regresses the performance and
that's what looked weird to me last week and I wanted to confirm this.
But from sheaves POV, it looks like they provide the benefits I saw
before. Sid's patches which I did not test separately before also look
beneficial.
Thanks,
Suren.
>
>
> >
> >
> ...
>
> Thanks,
> Liam
>
> [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/
> [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
On 3/14/25 18:10, Suren Baghdasaryan wrote: > On Tue, Mar 4, 2025 at 11:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: >> >> * Vlastimil Babka <vbabka@suse.cz> [250304 05:55]: >> > On 2/25/25 21:26, Suren Baghdasaryan wrote: >> > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote: >> > >> >> > >> > >> > >> > > The values represent the total time it took to perform mmap syscalls, less is >> > >> > > better. >> > >> > > >> > >> > > (1) baseline control >> > >> > > Little core 7.58327 6.614939 (-12.77%) >> > >> > > Medium core 2.125315 1.428702 (-32.78%) >> > >> > > Big core 0.514673 0.422948 (-17.82%) >> > >> > > >> > >> > > (2) baseline control >> > >> > > Little core 7.58327 5.141478 (-32.20%) >> > >> > > Medium core 2.125315 0.427692 (-79.88%) >> > >> > > Big core 0.514673 0.046642 (-90.94%) >> > >> > > >> > >> > > (3) baseline control >> > >> > > Little core 7.58327 4.779624 (-36.97%) >> > >> > > Medium core 2.125315 0.450368 (-78.81%) >> > >> > > Big core 0.514673 0.037776 (-92.66%) >> > > >> > > (4) baseline control >> > > Little core 7.58327 4.642977 (-38.77%) >> > > Medium core 2.125315 0.373692 (-82.42%) >> > > Big core 0.514673 0.043613 (-91.53%) >> > > >> > > I think the difference between (3) and (4) is noise. >> > > Thanks, >> > > Suren. >> > >> > Hi, as we discussed yesterday, it would be useful to set the baseline to >> > include everything before sheaves as that's already on the way to 6.15, so >> > we can see more clearly what sheaves do relative to that. So at this point >> > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone, >> > thus like in scenario (4)), and benchmark the following: >> > >> > - baseline - vma locking conversion with TYPESAFE_BY_RCU >> > - baseline+maple tree node reduction from mm-unstable (Liam might point out >> > which patches?) >> >> Sid's patches [1] are already in mm-unstable. >> >> >> > - the above + this series + sheaves enabled for vm_area_struct cache >> > - the above + full maple node sheaves conversion [1] >> > - the above + the top-most patches from [1] that are optimizations with a >> > tradeoff (not clear win-win) so it would be good to know if they are useful >> > >> > [1] currently the 4 commits here: >> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple >> > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf" >> > but as Liam noted, they won't cherry pick without conflict once maple tree >> > node reduction is backported, but he's working on a rebase >> >> Rebased maple tree sheaves, patches are here [2]. > > Hi Folks, > Sorry for the delay. I got the numbers last week but they looked a bit > weird, so I reran the test increasing the number of iterations to make > sure noise is not a factor. That took most of this week. Below are the > results. Please note that I had to backport the patchsets to 6.12 > because that's the closest stable Android kernel I can use. I measure > cumulative time to execute mmap syscalls, so the smaller the number > the better mmap performance is: Is that a particular benchmark doing those syscalls, or you time them within actual workloads? > baseline: 6.12 + vm_lock conversion and TYPESAFE_BY_RCU > config1: baseline + Sid's patches [1] > config2: sheaves RFC > config3: config1 + vm_area_struct with sheaves > config4: config2 + maple_tree Sheaf conversion [2] > config5: config3 + 2 last optimization patches from [3] > > config1 config2 config3 config4 config5 > Little core -0.10% -10.10% -12.89% -10.02% -13.64% > Mid core -21.05% -37.31% -44.97% -15.81% -22.15% > Big core -17.17% -34.41% -45.68% -11.39% -15.29% Thanks a lot, Suren. > [1] https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/ > [2] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304 > [3] https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple > > From the numbers, it looks like config4 regresses the performance and > that's what looked weird to me last week and I wanted to confirm this. > But from sheaves POV, it looks like they provide the benefits I saw > before. Sid's patches which I did not test separately before also look > beneficial. Indeed, good job, Sid. It's weird that config4 isn't doing well. The problem can be either in sheaves side (the sheaves preallocation isn't effective) or maple tree side doing some excessive work. It could be caused by the wrong condition in kmem_cache_return_sheaf() that Harry pointed out, so v3 might improve if that was it. Otherwise we'll probably need to fill the gaps in sheaf-related stats and see what are the differences between config3 and config4. > Thanks, > Suren. > >> >> >> > >> > >> ... >> >> Thanks, >> Liam >> >> [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/ >> [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304
On Mon, Mar 17, 2025 at 4:08 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 3/14/25 18:10, Suren Baghdasaryan wrote: > > On Tue, Mar 4, 2025 at 11:08 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > >> > >> * Vlastimil Babka <vbabka@suse.cz> [250304 05:55]: > >> > On 2/25/25 21:26, Suren Baghdasaryan wrote: > >> > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> > >> > >> > > >> > >> > > The values represent the total time it took to perform mmap syscalls, less is > >> > >> > > better. > >> > >> > > > >> > >> > > (1) baseline control > >> > >> > > Little core 7.58327 6.614939 (-12.77%) > >> > >> > > Medium core 2.125315 1.428702 (-32.78%) > >> > >> > > Big core 0.514673 0.422948 (-17.82%) > >> > >> > > > >> > >> > > (2) baseline control > >> > >> > > Little core 7.58327 5.141478 (-32.20%) > >> > >> > > Medium core 2.125315 0.427692 (-79.88%) > >> > >> > > Big core 0.514673 0.046642 (-90.94%) > >> > >> > > > >> > >> > > (3) baseline control > >> > >> > > Little core 7.58327 4.779624 (-36.97%) > >> > >> > > Medium core 2.125315 0.450368 (-78.81%) > >> > >> > > Big core 0.514673 0.037776 (-92.66%) > >> > > > >> > > (4) baseline control > >> > > Little core 7.58327 4.642977 (-38.77%) > >> > > Medium core 2.125315 0.373692 (-82.42%) > >> > > Big core 0.514673 0.043613 (-91.53%) > >> > > > >> > > I think the difference between (3) and (4) is noise. > >> > > Thanks, > >> > > Suren. > >> > > >> > Hi, as we discussed yesterday, it would be useful to set the baseline to > >> > include everything before sheaves as that's already on the way to 6.15, so > >> > we can see more clearly what sheaves do relative to that. So at this point > >> > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone, > >> > thus like in scenario (4)), and benchmark the following: > >> > > >> > - baseline - vma locking conversion with TYPESAFE_BY_RCU > >> > - baseline+maple tree node reduction from mm-unstable (Liam might point out > >> > which patches?) > >> > >> Sid's patches [1] are already in mm-unstable. > >> > >> > >> > - the above + this series + sheaves enabled for vm_area_struct cache > >> > - the above + full maple node sheaves conversion [1] > >> > - the above + the top-most patches from [1] that are optimizations with a > >> > tradeoff (not clear win-win) so it would be good to know if they are useful > >> > > >> > [1] currently the 4 commits here: > >> > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple > >> > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf" > >> > but as Liam noted, they won't cherry pick without conflict once maple tree > >> > node reduction is backported, but he's working on a rebase > >> > >> Rebased maple tree sheaves, patches are here [2]. > > > > Hi Folks, > > Sorry for the delay. I got the numbers last week but they looked a bit > > weird, so I reran the test increasing the number of iterations to make > > sure noise is not a factor. That took most of this week. Below are the > > results. Please note that I had to backport the patchsets to 6.12 > > because that's the closest stable Android kernel I can use. I measure > > cumulative time to execute mmap syscalls, so the smaller the number > > the better mmap performance is: > > Is that a particular benchmark doing those syscalls, or you time them within > actual workloads? I time them inside my workload. > > > baseline: 6.12 + vm_lock conversion and TYPESAFE_BY_RCU > > config1: baseline + Sid's patches [1] > > config2: sheaves RFC > > config3: config1 + vm_area_struct with sheaves > > config4: config2 + maple_tree Sheaf conversion [2] > > config5: config3 + 2 last optimization patches from [3] > > > > config1 config2 config3 config4 config5 > > Little core -0.10% -10.10% -12.89% -10.02% -13.64% > > Mid core -21.05% -37.31% -44.97% -15.81% -22.15% > > Big core -17.17% -34.41% -45.68% -11.39% -15.29% > > Thanks a lot, Suren. > > > [1] https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/ > > [2] https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304 > > [3] https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple > > > > From the numbers, it looks like config4 regresses the performance and > > that's what looked weird to me last week and I wanted to confirm this. > > But from sheaves POV, it looks like they provide the benefits I saw > > before. Sid's patches which I did not test separately before also look > > beneficial. > > Indeed, good job, Sid. It's weird that config4 isn't doing well. The problem > can be either in sheaves side (the sheaves preallocation isn't effective) or > maple tree side doing some excessive work. It could be caused by the wrong > condition in kmem_cache_return_sheaf() that Harry pointed out, so v3 might > improve if that was it. Otherwise we'll probably need to fill the gaps in > sheaf-related stats and see what are the differences between config3 and > config4. > > > Thanks, > > Suren. > > > >> > >> > >> > > >> > > >> ... > >> > >> Thanks, > >> Liam > >> > >> [1]. https://lore.kernel.org/linux-mm/20250227204823.758784-1-sidhartha.kumar@oracle.com/ > >> [2]. https://www.infradead.org/git/?p=users/jedix/linux-maple.git;a=shortlog;h=refs/heads/sheaves_rebase_20250304 >
On Tue, Mar 4, 2025 at 2:55 AM Vlastimil Babka <vbabka@suse.cz> wrote: > > On 2/25/25 21:26, Suren Baghdasaryan wrote: > > On Mon, Feb 24, 2025 at 1:12 PM Suren Baghdasaryan <surenb@google.com> wrote: > >> > >> > > >> > > The values represent the total time it took to perform mmap syscalls, less is > >> > > better. > >> > > > >> > > (1) baseline control > >> > > Little core 7.58327 6.614939 (-12.77%) > >> > > Medium core 2.125315 1.428702 (-32.78%) > >> > > Big core 0.514673 0.422948 (-17.82%) > >> > > > >> > > (2) baseline control > >> > > Little core 7.58327 5.141478 (-32.20%) > >> > > Medium core 2.125315 0.427692 (-79.88%) > >> > > Big core 0.514673 0.046642 (-90.94%) > >> > > > >> > > (3) baseline control > >> > > Little core 7.58327 4.779624 (-36.97%) > >> > > Medium core 2.125315 0.450368 (-78.81%) > >> > > Big core 0.514673 0.037776 (-92.66%) > > > > (4) baseline control > > Little core 7.58327 4.642977 (-38.77%) > > Medium core 2.125315 0.373692 (-82.42%) > > Big core 0.514673 0.043613 (-91.53%) > > > > I think the difference between (3) and (4) is noise. > > Thanks, > > Suren. > > Hi, as we discussed yesterday, it would be useful to set the baseline to > include everything before sheaves as that's already on the way to 6.15, so > we can see more clearly what sheaves do relative to that. So at this point > it's the vma lock conversion including TYPESAFE_BY_RCU (that's not undone, > thus like in scenario (4)), and benchmark the following: > > - baseline - vma locking conversion with TYPESAFE_BY_RCU > - baseline+maple tree node reduction from mm-unstable (Liam might point out > which patches?) > - the above + this series + sheaves enabled for vm_area_struct cache > - the above + full maple node sheaves conversion [1] > - the above + the top-most patches from [1] that are optimizations with a > tradeoff (not clear win-win) so it would be good to know if they are useful > > [1] currently the 4 commits here: > https://web.git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-percpu-sheaves-v2-maple > from "maple_tree: Sheaf conversion" to "maple_tree: Clean up sheaf" > but as Liam noted, they won't cherry pick without conflict once maple tree > node reduction is backported, but he's working on a rebase > > Thanks in advance! Sure, I'll run the tests and post results sometime later this week. Thanks! > > >> > > > >> > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct > >> > > yields slightly better averages and I noticed that this was mostly due > >> > > to sheaves results missing occasional spikes that worsened > >> > > TYPESAFE_BY_RCU averages (the results seemed more stable with > >> > > sheaves). > >> > > >> > Thanks a lot, that looks promising! > >> > >> Indeed, that looks better than I expected :) > >> Cheers! > >> > >> > > >> > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/ > >> > > > >> > >
On Sun, Feb 23, 2025 at 5:36 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Sat, Feb 22, 2025 at 8:44 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Sat, Feb 22, 2025 at 4:19 PM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Fri, Feb 14, 2025 at 05:27:36PM +0100, Vlastimil Babka wrote: > > > > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > > > > after Patch 5 it's preempt_disable() and no atomic operations. Same for > > > > freeing, which is normally a local double cmpxchg only for a short > > > > term allocations (so the same slab is still active on the same cpu when > > > > freeing the object) and a more costly locked double cmpxchg otherwise. > > > > The downside is the lack of NUMA locality guarantees for the allocated > > > > objects. > > > > > > Is that really cheaper than a local non locked double cmpxchg? > > > > Don't know about this particular part but testing sheaves with maple > > node cache and stress testing mmap/munmap syscalls shows performance > > benefits as long as there is some delay to let kfree_rcu() do its job. > > I'm still gathering results and will most likely post them tomorrow. > > Here are the promised test results: > > First I ran an Android app cycle test comparing the baseline against sheaves > used for maple tree nodes (as this patchset implements). I registered about > 3% improvement in app launch times, indicating improvement in mmap syscall > performance. > Next I ran an mmap stress test which maps 5 1-page readable file-backed > areas, faults them in and finally unmaps them, timing mmap syscalls. I forgot to mention that I also added a 500us delay after each cycle described above to give kfree_rcu() a chance to run. > Repeats that 200000 cycles and reports the total time. Average of 10 such > runs is used as the final result. > 3 configurations were tested: > > 1. Sheaves used for maple tree nodes only (this patchset). > > 2. Sheaves used for maple tree nodes with vm_lock to vm_refcnt conversion [1]. > This patchset avoids allocating additional vm_lock structure on each mmap > syscall and uses TYPESAFE_BY_RCU for vm_area_struct cache. > > 3. Sheaves used for maple tree nodes and for vm_area_struct cache with vm_lock > to vm_refcnt conversion [1]. For the vm_area_struct cache I had to replace > TYPESAFE_BY_RCU with sheaves, as we can't use both for the same cache. > > The values represent the total time it took to perform mmap syscalls, less is > better. > > (1) baseline control > Little core 7.58327 6.614939 (-12.77%) > Medium core 2.125315 1.428702 (-32.78%) > Big core 0.514673 0.422948 (-17.82%) > > (2) baseline control > Little core 7.58327 5.141478 (-32.20%) > Medium core 2.125315 0.427692 (-79.88%) > Big core 0.514673 0.046642 (-90.94%) > > (3) baseline control > Little core 7.58327 4.779624 (-36.97%) > Medium core 2.125315 0.450368 (-78.81%) > Big core 0.514673 0.037776 (-92.66%) > > Results in (3) vs (2) indicate that using sheaves for vm_area_struct > yields slightly better averages and I noticed that this was mostly due > to sheaves results missing occasional spikes that worsened > TYPESAFE_BY_RCU averages (the results seemed more stable with > sheaves). > > [1] https://lore.kernel.org/all/20250213224655.1680278-1-surenb@google.com/ > > > > > > > > > Especially if you now have to use pushf/popf... > > > > > > > - kfree_rcu() batching and recycling. kfree_rcu() will put objects to a > > > > separate percpu sheaf and only submit the whole sheaf to call_rcu() > > > > when full. After the grace period, the sheaf can be used for > > > > allocations, which is more efficient than freeing and reallocating > > > > individual slab objects (even with the batching done by kfree_rcu() > > > > implementation itself). In case only some cpus are allowed to handle rcu > > > > callbacks, the sheaf can still be made available to other cpus on the > > > > same node via the shared barn. The maple_node cache uses kfree_rcu() and > > > > thus can benefit from this. > > > > > > Have you looked at fs/bcachefs/rcu_pending.c?
On Fri, 14 Feb 2025, Vlastimil Babka wrote: > - Cheaper fast paths. For allocations, instead of local double cmpxchg, > after Patch 5 it's preempt_disable() and no atomic operations. Same for > freeing, which is normally a local double cmpxchg only for a short > term allocations (so the same slab is still active on the same cpu when > freeing the object) and a more costly locked double cmpxchg otherwise. > The downside is the lack of NUMA locality guarantees for the allocated > objects. The local double cmpxchg is not an atomic instruction. For that it would need a lock prefix. The local cmpxchg is atomic vs an interrupt because the interrupt can only occur between instructions. That is true for any processor instruction. We use the fact that the cmpxchg does a RMV in one unbreakable instruction to ensure that interrupts cannot do evil things to the fast path.
© 2016 - 2025 Red Hat, Inc.