mm/slab: enable runtime sheaves tuning

[PATCH RFC 0/8] mm/slab: enable runtime sheaves tuning

Posted by Harry Yoo (Oracle) 1 week, 2 days ago

Background
==========

Sheaves were introduced in v6.18, and starting from v7.0, they are
enabled for all slab caches (except for kmem_cache{,_node}). In the
pre-sheaves era, there was a cpu_partial parameter to tune the number
of objects cached per CPU. However, sheaves don't have an equivalent
and the sheaf capacity is determined in the kernel code.

The goal is to allow tuning sheaves at runtime by the next LTS.

Overview
========

This patchset does two main things:

  1. Make the sheaf_capacity sysfs attribute writable so that the number
     of objects cached per CPU can be changed at runtime, and

  2. Expose MAX_FULL_SHEAVES and MAX_EMPTY_SHEAVES as sysfs attributes
     rather than constants, so that users can tune them.

Measuring the performance impact of these tunables is TBD.

Roughly, the sequence to change sheaf_capacity is as follows:

  1. Disable sheaves. Make all online CPUs replace their main sheaves
     with the bootstrap sheaf under local_lock and wait for completion.

  2. Wait for all in-flight RCU callbacks to be processed.
  
  3. Flush and free all existing sheaves.

  4. Re-enable sheaves with a new capacity.

Challenges
==========

1. Allocations and frees can happen concurrently at any point between
   these steps, and we cannot introduce heavyweight synchronization
   mechanisms on the fastpath.

2. Currently, cache_has_sheaves() checks whether a cache has sheaves.
   This works now because sheaves cannot be enabled or disabled once
   the cache is created.

   The question "Does this cache has sheaves?" should be split into
   "Does this cache support sheaves?" and "Does this CPU actually has
    sheaves enabled right now?".

3. Once the sheaf capacity update is complete, no sheaf with stale
   capacity must remain. Flushing and freeing all existing sheaves is
   relatively simple, but under the current design it is quite
   challenging to prevent sheaves with stale capacity to be installed
   to the pcs or the barn. Reading s->sheaf_capacity without an
   expensive synchronization primitive is racy.

   Patch 6 introduces a copy of s->sheaf_capacity to struct
   slub_percpu_sheaves to address this. pcs->capacity is copied from
   s->sheaf_capacity and it is stable under local_lock. If
   s->sheaf_capacity and pcs->capacity don't match, the sheaf_capacity
   writer is responsible for flushing and freeing them before completing
   the process.

Patch Sequence
==============

Patch 1-3: A per-sheaf capacity is required for the following steps,
but I didn't want to grow struct slab_sheaf. So patch 1 drops the cache
pointer (which was used only on the slowpath), patch 2 changs
sheaf_capacity from unsigned int to unsigned short, and patch 3 adds
per-sheaf capacity.

Actually, the size is shrunken after those patches.

After (24 bytes, excluding the objects flex array):

struct slab_sheaf {
	union {
		struct rcu_head rcu_head;
		struct list_head barn_list;
		bool pfmemalloc;
	};
	unsigned short capacity;
	unsigned short size;
	int node;
	void *objects[];
};

Patch 4 allows bootstrap_cache_sheaves() to fail so that it can be
used to re-enable sheaves without panicking the kernel.

Patch 5 splits cache_has_sheaves() into cache_supports_sheaves()
and pcs_has_sheaves().

Patch 6 enables tuning the sheaf capacity at runtime.

Patch 7 adds lockdep asserts to verify the new rule "Always hold
local_lock when accessing the barn" to make sure there is no sheaf
with stale capacity.

Patch 8 turns MAX_FULL_SHEAVES and MAX_EMPTY_SHEAVES into sysfs
attributes (max_full_sheaves, max_empty_sheaves) and allows tuning.

RFC V1 is also available in git at:
https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=sheaves-tuning-rfc-v1r1

Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
Harry Yoo (Oracle) (8):
      mm/slab: do not store cache pointer in struct slab_sheaf
      mm/slab: change sheaf_capacity type to unsigned short
      mm/slab: track capacity per sheaf
      mm/slab: allow bootstrap_cache_sheaves() to fail
      mm/slab: rework cache_has_sheaves() to check immutable properties only
      mm/slab: allow changing sheaf_capacity at runtime
      mm/slab: add pcs->lock lockdep assert when accessing the barn
      mm/slab: allow changing max_{full,empty}_sheaves at runtime

 include/linux/slab.h         |   8 +-
 mm/slab.h                    |  40 ++-
 mm/slab_common.c             |   2 +-
 mm/slub.c                    | 715 ++++++++++++++++++++++++++++++-------------
 tools/include/linux/slab.h   |  14 +-
 tools/testing/shared/linux.c |   4 +-
 6 files changed, 563 insertions(+), 220 deletions(-)
---
base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
change-id: 20260515-sheaves-tuning-e1f897dc7f5e

Best regards,
--  
Cheers,
Harry / Hyeonggon

Re: [PATCH RFC 0/8] mm/slab: enable runtime sheaves tuning

Posted by Pedro Falcato 6 days, 19 hours ago

On Sat, May 16, 2026 at 01:24:24AM +0900, Harry Yoo (Oracle) wrote:
> Background
> ==========
> 
> Sheaves were introduced in v6.18, and starting from v7.0, they are
> enabled for all slab caches (except for kmem_cache{,_node}). In the
> pre-sheaves era, there was a cpu_partial parameter to tune the number
> of objects cached per CPU. However, sheaves don't have an equivalent
> and the sheaf capacity is determined in the kernel code.

What semantic do you need from this?

> 
> The goal is to allow tuning sheaves at runtime by the next LTS.
> 
> Overview
> ========
> 
> This patchset does two main things:
> 
>   1. Make the sheaf_capacity sysfs attribute writable so that the number
>      of objects cached per CPU can be changed at runtime, and
> 
>   2. Expose MAX_FULL_SHEAVES and MAX_EMPTY_SHEAVES as sysfs attributes
>      rather than constants, so that users can tune them.
> 
> Measuring the performance impact of these tunables is TBD.
> 
> Roughly, the sequence to change sheaf_capacity is as follows:
> 
>   1. Disable sheaves. Make all online CPUs replace their main sheaves
>      with the bootstrap sheaf under local_lock and wait for completion.

This is extremely destabilizing performance-wise, were I to guess.

> 
>   2. Wait for all in-flight RCU callbacks to be processed.

and this too.

>   
>   3. Flush and free all existing sheaves.
> 
>   4. Re-enable sheaves with a new capacity.
> 
> Challenges
> ==========
> 
> 1. Allocations and frees can happen concurrently at any point between
>    these steps, and we cannot introduce heavyweight synchronization
>    mechanisms on the fastpath.
> 
> 2. Currently, cache_has_sheaves() checks whether a cache has sheaves.
>    This works now because sheaves cannot be enabled or disabled once
>    the cache is created.
> 
>    The question "Does this cache has sheaves?" should be split into
>    "Does this cache support sheaves?" and "Does this CPU actually has
>     sheaves enabled right now?".
> 
> 3. Once the sheaf capacity update is complete, no sheaf with stale
>    capacity must remain.

Why? I don't see a huge problem with having multiple sheaves with different
capacities, as long as you adequately, opportunistically kill the sheaves
if they don't have the desired size (say, once a sheaf is fully empty).

-- 
Pedro

Re: [PATCH RFC 0/8] mm/slab: enable runtime sheaves tuning

Posted by Harry Yoo 5 days, 2 hours ago

On 5/18/26 8:52 PM, Pedro Falcato wrote:
> On Sat, May 16, 2026 at 01:24:24AM +0900, Harry Yoo (Oracle) wrote:
>> Background
>> ==========
>>
>> Sheaves were introduced in v6.18, and starting from v7.0, they are
>> enabled for all slab caches (except for kmem_cache{,_node}). In the
>> pre-sheaves era, there was a cpu_partial parameter to tune the number
>> of objects cached per CPU. However, sheaves don't have an equivalent
>> and the sheaf capacity is determined in the kernel code.
> 
> What semantic do you need from this?

The intent is to allow adjusting sheaf capacity to mitigate per-node 
barn / slab list contention on the slowpath (for servers with many 
CPUs), similar to the 'cpu_partial' tunable in SLUB and the 'limit' 
tunable in SLAB.

However, the semantics are slightly different from 'cpu_partial' and 
'limit', as changing sheaf_capacity also affects the number of objects 
cached in the barn.
>> Challenges
>> ==========
>>
>> 1. Allocations and frees can happen concurrently at any point between
>>     these steps, and we cannot introduce heavyweight synchronization
>>     mechanisms on the fastpath.
>>
>> 2. Currently, cache_has_sheaves() checks whether a cache has sheaves.
>>     This works now because sheaves cannot be enabled or disabled once
>>     the cache is created.
>>
>>     The question "Does this cache has sheaves?" should be split into
>>     "Does this cache support sheaves?" and "Does this CPU actually has
>>      sheaves enabled right now?".
>>
>> 3. Once the sheaf capacity update is complete, no sheaf with stale
>>     capacity must remain.
> 
> Why? I don't see a huge problem with having multiple sheaves with different
> capacities, as long as you adequately, opportunistically kill the sheaves
> if they don't have the desired size (say, once a sheaf is fully empty).

Haha, you got me.

Right, enforcing a single capacity at any given point introduced so much 
complexity that I started wondering myself about whether this is really 
essential.

My main concern was that the performance characteristics would become 
too unpredictable, but actually, users can avoid that by disabling 
sheaves, shrinking it, and re-enabling it. So that's not an enough 
justification.

When I first started, I was quite cautious and obsessed with the 
invariant because many parts of the current implementation assume "a 
kmem_cache has only a single capacity, and it doesn't change", but 
that's also addressed by this patchset. So that's not a big issue either.

I agree that it is worth trying to allow sheaves of different capacities 
and hopefully that would be less intrusive. Let's see.

Thank you, Pedro.

-- 
Cheers,
Harry / Hyeonggon