[v1] slab: replace cpu (partial) slabs with sheaves

[PATCH RFC 00/19] slab: replace cpu (partial) slabs with sheaves

Posted by Vlastimil Babka 3 months, 2 weeks ago

Percpu sheaves caching was introduced as opt-in but the goal was to
eventually move all caches to them. This is the next step, enabling
sheaves for all caches (except the two bootstrap ones) and then removing
the per cpu (partial) slabs and lots of associated code.

Besides (hopefully) improved performance, this removes the rather
complicated code related to the lockless fastpaths (using
this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
kmalloc_nolock().

The lockless slab freelist+counters update operation using
try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
without repeating the "alien" array flushing of SLUB, and to allow
flushing objects from sheaves to slabs mostly without the node
list_lock.

This is the first RFC to get feedback. Biggest TODOs are:

- cleanup of stat counters to fit the new scheme
- integration of rcu sheaves handling with kfree_rcu batching
- performance evaluation

Git branch: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=b4/sheaves-for-all

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
Vlastimil Babka (19):
      slab: move kfence_alloc() out of internal bulk alloc
      slab: handle pfmemalloc slabs properly with sheaves
      slub: remove CONFIG_SLUB_TINY specific code paths
      slab: prevent recursive kmalloc() in alloc_empty_sheaf()
      slab: add sheaves to most caches
      slab: introduce percpu sheaves bootstrap
      slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()
      slab: handle kmalloc sheaves bootstrap
      slab: add optimized sheaf refill from partial list
      slab: remove cpu (partial) slabs usage from allocation paths
      slab: remove SLUB_CPU_PARTIAL
      slab: remove the do_slab_free() fastpath
      slab: remove defer_deactivate_slab()
      slab: simplify kmalloc_nolock()
      slab: remove struct kmem_cache_cpu
      slab: remove unused PREEMPT_RT specific macros
      slab: refill sheaves from all nodes
      slab: update overview comments
      slab: remove frozen slab checks from __slab_free()

 include/linux/gfp_types.h |    6 -
 include/linux/slab.h      |    6 -
 mm/Kconfig                |   11 -
 mm/internal.h             |    1 +
 mm/page_alloc.c           |    5 +
 mm/slab.h                 |   47 +-
 mm/slub.c                 | 2601 ++++++++++++++++-----------------------------
 7 files changed, 915 insertions(+), 1762 deletions(-)
---
base-commit: 7b34bb10d15c412cdce0a1ea3b5701888b885673
change-id: 20251002-sheaves-for-all-86ac13dc47a5

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>

Re: [PATCH RFC 00/19] slab: replace cpu (partial) slabs with sheaves

Posted by Alexei Starovoitov 3 months, 2 weeks ago

On Thu, Oct 23, 2025 at 6:53 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> Percpu sheaves caching was introduced as opt-in but the goal was to
> eventually move all caches to them. This is the next step, enabling
> sheaves for all caches (except the two bootstrap ones) and then removing
> the per cpu (partial) slabs and lots of associated code.
>
> Besides (hopefully) improved performance, this removes the rather
> complicated code related to the lockless fastpaths (using
> this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
> kmalloc_nolock().
>
> The lockless slab freelist+counters update operation using
> try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
> without repeating the "alien" array flushing of SLUB, and to allow
> flushing objects from sheaves to slabs mostly without the node
> list_lock.
>
> This is the first RFC to get feedback. Biggest TODOs are:
>
> - cleanup of stat counters to fit the new scheme
> - integration of rcu sheaves handling with kfree_rcu batching

The whole thing looks good, and imo these two are lower priority.

> - performance evaluation

The performance results will be the key.
What kind of benchmarks do you have in mind?

Re: [PATCH RFC 00/19] slab: replace cpu (partial) slabs with sheaves

Posted by Christoph Lameter (Ampere) 3 months ago

On Thu, 23 Oct 2025, Vlastimil Babka wrote:

> Besides (hopefully) improved performance, this removes the rather
> complicated code related to the lockless fastpaths (using
> this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
> kmalloc_nolock().

Going back to a strict LIFO scheme for alloc/free removes the following
performance features:

1. Objects are served randomly from a variety of slab pages instead of
serving all available objects from a single slab page and then from the
next. This means that the objects require a larger set of TLB entries to
cover. TLB pressure will increase.

2. The number of partial slabs will increase since the free objects in a
partial page are not used up before moving onto the next. Instead free
objects from random slab pages are used.

Spatial object locality is reduced. Temporal object hotness increases.

> The lockless slab freelist+counters update operation using
> try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
> without repeating the "alien" array flushing of SLUB, and to allow
> flushing objects from sheaves to slabs mostly without the node
> list_lock.

Hmm... So potential cache hot objects are lost that way and reused on
another node next. The role of the alien caches in SLAB was to cover that
case and we saw performance regressions without these caches.

The method of freeing still reduces the amount of remote partial slabs
that have to be managed and increases the locality of the objects.

Re: [PATCH RFC 00/19] slab: replace cpu (partial) slabs with sheaves

Posted by Vlastimil Babka 3 weeks, 6 days ago

On 11/4/25 23:11, Christoph Lameter (Ampere) wrote:
> On Thu, 23 Oct 2025, Vlastimil Babka wrote:
> 
>> Besides (hopefully) improved performance, this removes the rather
>> complicated code related to the lockless fastpaths (using
>> this_cpu_try_cmpxchg128/64) and its complications with PREEMPT_RT or
>> kmalloc_nolock().

Sorry for the late reply and thanks for the insights, I will incorporate
them to the cover letter.

> Going back to a strict LIFO scheme for alloc/free removes the following
> performance features:
> 
> 1. Objects are served randomly from a variety of slab pages instead of
> serving all available objects from a single slab page and then from the
> next. This means that the objects require a larger set of TLB entries to
> cover. TLB pressure will increase.

OK. Should be mitigated by the huge direct mappings hopefully. Also IIRC
when Mike was evaluating patches to preserve the huge mappings better
against splitting, the benefits were so low it was abandoned, so that
suggests the TLB pressure on direct map isn't that bad.

> 2. The number of partial slabs will increase since the free objects in a
> partial page are not used up before moving onto the next. Instead free
> objects from random slab pages are used.

Agreed. Should be bounded by the number of cpu+barn sheaves.

> Spatial object locality is reduced. Temporal object hotness increases.

Ack.

>> The lockless slab freelist+counters update operation using
>> try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects
>> without repeating the "alien" array flushing of SLUB, and to allow
>> flushing objects from sheaves to slabs mostly without the node
>> list_lock.
> 
> Hmm... So potential cache hot objects are lost that way and reused on
> another node next. The role of the alien caches in SLAB was to cover that
> case and we saw performance regressions without these caches.

Interesting observation. I think commit e00946fe2351 ("[PATCH] slab: Bypass
free lists for __drain_alien_cache()") is relevant?

But I wonder, wouldn't the objects tend to be cache hot on the cpu which was
freeing them (and to which they were remote), but after that alien->shared
array transfer then reallocated on a different cpu (to which they are
local)? So I wouldn't expect cache hotness benefits there?

> The method of freeing still reduces the amount of remote partial slabs
> that have to be managed and increases the locality of the objects.

Ack.