[RFC PATCH 01/15] static kmem_cache instances for core caches

Al Viro posted 15 patches 4 weeks, 1 day ago
[RFC PATCH 01/15] static kmem_cache instances for core caches
Posted by Al Viro 4 weeks, 1 day ago
        kmem_cache_create() and friends create new instances of
struct kmem_cache and return pointers to those.  Quite a few things in
core kernel are allocated from such caches; each allocation involves
dereferencing an assign-once pointer and for sufficiently hot ones that
dereferencing does show in profiles.

        There had been patches floating around switching some of those
to runtime_const infrastructure.  Unfortunately, it's arch-specific
and most of the architectures lack it.

        There's an alternative approach applicable at least to the caches
that are never destroyed, which covers a lot of them.  No matter what,
runtime_const for pointers is not going to be faster than plain &,
so if we had struct kmem_cache instances with static storage duration, we
would be at least no worse off than we are with runtime_const variants.

        There are obstacles to doing that, but they turn out to be easy
to deal with.

1) as it is, struct kmem_cache is opaque for anything outside of a few
files in mm/*; that avoids serious headache with header dependencies,
etc., and it's not something we want to lose.  Solution: struct
kmem_cache_opaque, with the size and alignment identical to struct
kmem_cache.  Calculation of size and alignment can be done via the same
mechanism we use for asm-offsets.h and rq-offsets.h, with build-time
check for mismatches.  With that done, we get an opaque type defined in
linux/slab-static.h that can be used for declaring those caches.
In linux/slab.h we add a forward declaration of kmem_cache_opaque +
helper (to_kmem_cache()) converting a pointer to kmem_cache_opaque
into pointer to kmem_cache.

2) real constructor of kmem_cache needs to be taught to deal with
preallocated instances.  That turns out to be easy - we already pass an
obscene amount of optional arguments via struct kmem_cache_args, so we
can stash the pointer to preallocated instance in there.  Changes in
mm/slab_common.c are very minor - we should treat preallocated caches
as unmergable, use the instance passed to us instead of allocating a
new one and we should not free them.  That's it.

	Note that slab-static.h is needed only in places that create
such instances; all users need only slab.h (and they can be modular,
unlike runtime_const-based approach).

	That covers the instances that never get destroyed.  Quite a few
fall into that category, but there's a major exception - anything in
modules must be destroyed before the module gets removed.  For example,
filesystems that have their inodes allocated from a private kmem_cache
can't make use of that technics for their inode allocations, etc.

	It's not that hard to deal with, but for now let's just ban
including slab-static.h from modules.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 Kbuild                      | 13 +++++++-
 include/linux/slab-static.h | 65 +++++++++++++++++++++++++++++++++++++
 include/linux/slab.h        |  7 ++++
 mm/kmem_cache_size.c        | 20 ++++++++++++
 mm/slab_common.c            | 30 ++++++++---------
 mm/slub.c                   |  7 ++++
 6 files changed, 126 insertions(+), 16 deletions(-)
 create mode 100644 include/linux/slab-static.h
 create mode 100644 mm/kmem_cache_size.c

diff --git a/Kbuild b/Kbuild
index 13324b4bbe23..eb985a6614eb 100644
--- a/Kbuild
+++ b/Kbuild
@@ -45,13 +45,24 @@ kernel/sched/rq-offsets.s: $(offsets-file)
 $(rq-offsets-file): kernel/sched/rq-offsets.s FORCE
 	$(call filechk,offsets,__RQ_OFFSETS_H__)
 
+# generate kmem_cache_size.h
+
+kmem_cache_size-file := include/generated/kmem_cache_size.h
+
+targets += mm/kmem_cache_size.s
+
+mm/kmem_cache_size.s: $(rq-offsets-file)
+
+$(kmem_cache_size-file): mm/kmem_cache_size.s FORCE
+	$(call filechk,offsets,__KMEM_CACHE_SIZE_H__)
+
 # Check for missing system calls
 
 quiet_cmd_syscalls = CALL    $<
       cmd_syscalls = $(CONFIG_SHELL) $< $(CC) $(c_flags) $(missing_syscalls_flags)
 
 PHONY += missing-syscalls
-missing-syscalls: scripts/checksyscalls.sh $(rq-offsets-file)
+missing-syscalls: scripts/checksyscalls.sh $(kmem_cache_size-file)
 	$(call cmd,syscalls)
 
 # Check the manual modification of atomic headers
diff --git a/include/linux/slab-static.h b/include/linux/slab-static.h
new file mode 100644
index 000000000000..47b2220b4988
--- /dev/null
+++ b/include/linux/slab-static.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_SLAB_STATIC_H
+#define _LINUX_SLAB_STATIC_H
+
+#ifdef MODULE
+#error "can't use that in modules"
+#endif
+
+#include <generated/kmem_cache_size.h>
+
+/* same size and alignment as struct kmem_cache: */
+struct kmem_cache_opaque {
+	unsigned char opaque[KMEM_CACHE_SIZE];
+} __aligned(KMEM_CACHE_ALIGN);
+
+#define __KMEM_CACHE_SETUP(cache, name, size, flags, ...)	\
+		__kmem_cache_create_args((name), (size),	\
+			&(struct kmem_cache_args) {		\
+				.preallocated = (cache),	\
+				__VA_ARGS__}, (flags))
+
+static inline int
+kmem_cache_setup_usercopy(struct kmem_cache *s,
+			  const char *name, unsigned int size,
+			  unsigned int align, slab_flags_t flags,
+			  unsigned int useroffset, unsigned int usersize,
+			  void (*ctor)(void *))
+{
+	struct kmem_cache *res;
+	res = __KMEM_CACHE_SETUP(s, name, size, flags,
+				.align		= align,
+				.ctor		= ctor,
+				.useroffset	= useroffset,
+				.usersize	= usersize);
+	if (IS_ERR(res))
+		return PTR_ERR(res);
+	return 0;
+}
+
+static inline int
+kmem_cache_setup(struct kmem_cache *s,
+		 const char *name, unsigned int size,
+		 unsigned int align, slab_flags_t flags,
+		 void (*ctor)(void *))
+{
+	struct kmem_cache *res;
+	res = __KMEM_CACHE_SETUP(s, name, size, flags,
+				.align		= align,
+				.ctor		= ctor);
+	if (IS_ERR(res))
+		return PTR_ERR(res);
+	return 0;
+}
+
+#define KMEM_CACHE_SETUP(s, __struct, __flags)                          	\
+	__KMEM_CACHE_SETUP((s), #__struct, sizeof(struct __struct), (__flags),	\
+			.align	= __alignof__(struct __struct))
+
+#define KMEM_CACHE_SETUP_USERCOPY(s, __struct, __flags, __field)		\
+	__KMEM_CACHE_SETUP((s), #__struct, sizeof(struct __struct), (__flags),	\
+			.align	= __alignof__(struct __struct),			\
+			.useroffset = offsetof(struct __struct, __field),	\
+			.usersize = sizeof_field(struct __struct, __field))
+
+#endif
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2482992248dc..f16c784148b4 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -261,11 +261,17 @@ enum _slab_flag_bits {
 
 struct list_lru;
 struct mem_cgroup;
+struct kmem_cache_opaque;
 /*
  * struct kmem_cache related prototypes
  */
 bool slab_is_available(void);
 
+static inline struct kmem_cache *to_kmem_cache(struct kmem_cache_opaque *p)
+{
+	return (struct kmem_cache *)p;
+}
+
 /**
  * struct kmem_cache_args - Less common arguments for kmem_cache_create()
  *
@@ -366,6 +372,7 @@ struct kmem_cache_args {
 	 * %0 means no sheaves will be created.
 	 */
 	unsigned int sheaf_capacity;
+	struct kmem_cache *preallocated;
 };
 
 struct kmem_cache *__kmem_cache_create_args(const char *name,
diff --git a/mm/kmem_cache_size.c b/mm/kmem_cache_size.c
new file mode 100644
index 000000000000..1ddbfa41a507
--- /dev/null
+++ b/mm/kmem_cache_size.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Generate definitions needed by the preprocessor.
+ * This code generates raw asm output which is post-processed
+ * to extract and format the required data.
+ */
+
+#define COMPILE_OFFSETS
+#include <linux/kbuild.h>
+#include "slab.h"
+
+int main(void)
+{
+	/* The constants to put into include/generated/kmem_cache_size.h */
+	DEFINE(KMEM_CACHE_SIZE, sizeof(struct kmem_cache));
+	DEFINE(KMEM_CACHE_ALIGN, __alignof(struct kmem_cache));
+	/* End of constants */
+
+	return 0;
+}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index eed7ea556cb1..81a413b44afb 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -224,33 +224,30 @@ static struct kmem_cache *create_cache(const char *name,
 				       struct kmem_cache_args *args,
 				       slab_flags_t flags)
 {
-	struct kmem_cache *s;
+	struct kmem_cache *s = args->preallocated;
 	int err;
 
 	/* If a custom freelist pointer is requested make sure it's sane. */
-	err = -EINVAL;
 	if (args->use_freeptr_offset &&
 	    (args->freeptr_offset >= object_size ||
 	     !(flags & SLAB_TYPESAFE_BY_RCU) ||
 	     !IS_ALIGNED(args->freeptr_offset, __alignof__(freeptr_t))))
-		goto out;
+		return ERR_PTR(-EINVAL);
 
-	err = -ENOMEM;
-	s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
-	if (!s)
-		goto out;
+	if (!s) {
+		s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
+		if (!s)
+			return ERR_PTR(-ENOMEM);
+	}
 	err = do_kmem_cache_create(s, name, object_size, args, flags);
-	if (err)
-		goto out_free_cache;
-
+	if (unlikely(err)) {
+		if (!args->preallocated)
+			kmem_cache_free(kmem_cache, s);
+		return ERR_PTR(err);
+	}
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
 	return s;
-
-out_free_cache:
-	kmem_cache_free(kmem_cache, s);
-out:
-	return ERR_PTR(err);
 }
 
 /**
@@ -324,6 +321,9 @@ struct kmem_cache *__kmem_cache_create_args(const char *name,
 		    object_size - args->usersize < args->useroffset))
 		args->usersize = args->useroffset = 0;
 
+	if (args->preallocated)
+		flags |= SLAB_NO_MERGE;
+
 	if (!args->usersize && !args->sheaf_capacity)
 		s = __kmem_cache_alias(name, object_size, args->align, flags,
 				       args->ctor);
diff --git a/mm/slub.c b/mm/slub.c
index 861592ac5425..41fe79b3f055 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -47,6 +47,7 @@
 #include <linux/irq_work.h>
 #include <linux/kprobes.h>
 #include <linux/debugfs.h>
+#include <linux/slab-static.h>
 #include <trace/events/kmem.h>
 
 #include "internal.h"
@@ -8491,6 +8492,12 @@ void __init kmem_cache_init(void)
 		boot_kmem_cache_node;
 	int node;
 
+	/* verify that kmem_cache_opaque is correct */
+	BUILD_BUG_ON(sizeof(struct kmem_cache) !=
+		     sizeof(struct kmem_cache_opaque));
+	BUILD_BUG_ON(__alignof(struct kmem_cache) !=
+		     __alignof(struct kmem_cache_opaque));
+
 	if (debug_guardpage_minorder())
 		slub_max_order = 0;
 
-- 
2.47.3
Re: [RFC PATCH 01/15] static kmem_cache instances for core caches
Posted by Harry Yoo 3 weeks, 4 days ago
On Sat, Jan 10, 2026 at 04:02:03AM +0000, Al Viro wrote:
>         kmem_cache_create() and friends create new instances of
> struct kmem_cache and return pointers to those.  Quite a few things in
> core kernel are allocated from such caches; each allocation involves
> dereferencing an assign-once pointer and for sufficiently hot ones that
> dereferencing does show in profiles.
> 
>         There had been patches floating around switching some of those
> to runtime_const infrastructure.  Unfortunately, it's arch-specific
> and most of the architectures lack it.
> 
>         There's an alternative approach applicable at least to the caches
> that are never destroyed, which covers a lot of them.  No matter what,
> runtime_const for pointers is not going to be faster than plain &,
> so if we had struct kmem_cache instances with static storage duration, we
> would be at least no worse off than we are with runtime_const variants.
> 
>         There are obstacles to doing that, but they turn out to be easy
> to deal with.
> 
> 1) as it is, struct kmem_cache is opaque for anything outside of a few
> files in mm/*; that avoids serious headache with header dependencies,
> etc., and it's not something we want to lose.  Solution: struct
> kmem_cache_opaque, with the size and alignment identical to struct
> kmem_cache.  Calculation of size and alignment can be done via the same
> mechanism we use for asm-offsets.h and rq-offsets.h, with build-time
> check for mismatches.  With that done, we get an opaque type defined in
> linux/slab-static.h that can be used for declaring those caches.
> In linux/slab.h we add a forward declaration of kmem_cache_opaque +
> helper (to_kmem_cache()) converting a pointer to kmem_cache_opaque
> into pointer to kmem_cache.
> 
> 2) real constructor of kmem_cache needs to be taught to deal with
> preallocated instances.  That turns out to be easy - we already pass an
> obscene amount of optional arguments via struct kmem_cache_args, so we
> can stash the pointer to preallocated instance in there.  Changes in
> mm/slab_common.c are very minor - we should treat preallocated caches
> as unmergable, use the instance passed to us instead of allocating a
> new one and we should not free them.  That's it.

SLAB_NO_MERGE prevents both side of merging - when 1) creating the cache,
and when 2) another cache tries to create an alias from it.

Avoiding 1) makes sense, but is there a reason to prevent 2)?

If it's fine for other caches to merge into a cache with static
duration, then it's sufficient to update find_mergeable() to not attempt
creating an alias during cache creation if args->preallocated is
specified (instead of using SLAB_NO_MERGE).

-- 
Cheers,
Harry / Hyeonggon

> 	Note that slab-static.h is needed only in places that create
> such instances; all users need only slab.h (and they can be modular,
> unlike runtime_const-based approach).
> 
> 	That covers the instances that never get destroyed.  Quite a few
> fall into that category, but there's a major exception - anything in
> modules must be destroyed before the module gets removed.  For example,
> filesystems that have their inodes allocated from a private kmem_cache
> can't make use of that technics for their inode allocations, etc.
> 
> 	It's not that hard to deal with, but for now let's just ban
> including slab-static.h from modules.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Re: [RFC PATCH 01/15] static kmem_cache instances for core caches
Posted by Vlastimil Babka 3 weeks, 3 days ago
On 1/14/26 08:30, Harry Yoo wrote:
> On Sat, Jan 10, 2026 at 04:02:03AM +0000, Al Viro wrote:
>>         kmem_cache_create() and friends create new instances of
>> struct kmem_cache and return pointers to those.  Quite a few things in
>> core kernel are allocated from such caches; each allocation involves
>> dereferencing an assign-once pointer and for sufficiently hot ones that
>> dereferencing does show in profiles.
>> 
>>         There had been patches floating around switching some of those
>> to runtime_const infrastructure.  Unfortunately, it's arch-specific
>> and most of the architectures lack it.
>> 
>>         There's an alternative approach applicable at least to the caches
>> that are never destroyed, which covers a lot of them.  No matter what,
>> runtime_const for pointers is not going to be faster than plain &,
>> so if we had struct kmem_cache instances with static storage duration, we
>> would be at least no worse off than we are with runtime_const variants.
>> 
>>         There are obstacles to doing that, but they turn out to be easy
>> to deal with.
>> 
>> 1) as it is, struct kmem_cache is opaque for anything outside of a few
>> files in mm/*; that avoids serious headache with header dependencies,
>> etc., and it's not something we want to lose.  Solution: struct
>> kmem_cache_opaque, with the size and alignment identical to struct
>> kmem_cache.  Calculation of size and alignment can be done via the same
>> mechanism we use for asm-offsets.h and rq-offsets.h, with build-time
>> check for mismatches.  With that done, we get an opaque type defined in
>> linux/slab-static.h that can be used for declaring those caches.
>> In linux/slab.h we add a forward declaration of kmem_cache_opaque +
>> helper (to_kmem_cache()) converting a pointer to kmem_cache_opaque
>> into pointer to kmem_cache.
>> 
>> 2) real constructor of kmem_cache needs to be taught to deal with
>> preallocated instances.  That turns out to be easy - we already pass an
>> obscene amount of optional arguments via struct kmem_cache_args, so we
>> can stash the pointer to preallocated instance in there.  Changes in
>> mm/slab_common.c are very minor - we should treat preallocated caches
>> as unmergable, use the instance passed to us instead of allocating a
>> new one and we should not free them.  That's it.
> 
> SLAB_NO_MERGE prevents both side of merging - when 1) creating the cache,
> and when 2) another cache tries to create an alias from it.
> 
> Avoiding 1) makes sense, but is there a reason to prevent 2)?
> 
> If it's fine for other caches to merge into a cache with static
> duration, then it's sufficient to update find_mergeable() to not attempt
> creating an alias during cache creation if args->preallocated is
> specified (instead of using SLAB_NO_MERGE).

The merging prevention is my biggest concern with the approach. We could
potentially solve it by moving the sharing to a different layer than today's
sharing of kmem_cache objects with refcount, and instead have separate
instances that point to the same underlying storage (mainly the per-node and
per-cpu slabs/sheaves). It's possible it would also simplify the suboptimal
sysfs handling of today as the aliases could know their cache name and own
their symlinks.

However slabs and sheaves do have a parent kmem_cache pointer. It's how e.g.
kfree() works by virt_to_slab(obj) -> kmem_cache and then being like
kmem_cache_free().

So we could have kmem_cache->primary_cache field where the primary would
just point to self and aliasing caches to the primary, and newly created
slabs and sheaves would read that ->primary_cache to assign their kmem_cache
pointer. This is not a fasthpath operation so it shouldn't matter, and with
that there wouldn't be any mix of differing cache pointers so the aliases
could be destroyed easily. And then the primary cache wouldn't be able go
away as long as there are aliases, as it is today.

Only a dynamic cache or a non-module static cache thus could become a
primary, for module unload reasons.

For this to work fully mergeable in all scenarios of the order of creating
static vs dynamic aliases, there would however have to be a weird quirk for
static module caches - when such a cache is created, and there's no
compatible primary to become alias of, a dynamic, otherwise unused primary
would need to be created just to become the owner of the slabs and sheaves.
Because if a mergeable dynamic cache appears later, it would not be able to
become a primary for the static module cache to become alias of, because the
static module cache would already have existing slabs and sheaves pointing
to it.

And there might be other issues with this scheme I don't immediately see.
But maybe it's feasible.
Re: [RFC PATCH 01/15] static kmem_cache instances for core caches
Posted by Al Viro 3 weeks, 4 days ago
On Wed, Jan 14, 2026 at 04:30:24PM +0900, Harry Yoo wrote:

> SLAB_NO_MERGE prevents both side of merging - when 1) creating the cache,
> and when 2) another cache tries to create an alias from it.
> 
> Avoiding 1) makes sense, but is there a reason to prevent 2)?
> 
> If it's fine for other caches to merge into a cache with static
> duration, then it's sufficient to update find_mergeable() to not attempt
> creating an alias during cache creation if args->preallocated is
> specified (instead of using SLAB_NO_MERGE).

Umm...  For static-in-module - definitely (what if it goes away before
the dynamic alias?), for globally static... might be fine, I guess...
Re: [RFC PATCH 01/15] static kmem_cache instances for core caches
Posted by Matthew Wilcox 4 weeks, 1 day ago
On Sat, Jan 10, 2026 at 04:02:03AM +0000, Al Viro wrote:
> +++ b/Kbuild
> @@ -45,13 +45,24 @@ kernel/sched/rq-offsets.s: $(offsets-file)
>  $(rq-offsets-file): kernel/sched/rq-offsets.s FORCE
>  	$(call filechk,offsets,__RQ_OFFSETS_H__)
>  
> +# generate kmem_cache_size.h
> +
> +kmem_cache_size-file := include/generated/kmem_cache_size.h
> +
> +targets += mm/kmem_cache_size.s
> +
> +mm/kmem_cache_size.s: $(rq-offsets-file)
> +
> +$(kmem_cache_size-file): mm/kmem_cache_size.s FORCE
> +	$(call filechk,offsets,__KMEM_CACHE_SIZE_H__)
> +
>  # Check for missing system calls
>  
>  quiet_cmd_syscalls = CALL    $<
>        cmd_syscalls = $(CONFIG_SHELL) $< $(CC) $(c_flags) $(missing_syscalls_flags)
>  
>  PHONY += missing-syscalls
> -missing-syscalls: scripts/checksyscalls.sh $(rq-offsets-file)
> +missing-syscalls: scripts/checksyscalls.sh $(kmem_cache_size-file)
>  	$(call cmd,syscalls)

Did you mean to _replace_  rq-offsets-file rather than add
kmem_cache_size-file ?

(I also wonder if we want to just do slab or if we want to make this
mm-offsets.h and maybe put other things in it later, but I'm having
trouble thinking of other things we might want to generate)
Re: [RFC PATCH 01/15] static kmem_cache instances for core caches
Posted by Al Viro 4 weeks, 1 day ago
On Sat, Jan 10, 2026 at 05:40:34AM +0000, Matthew Wilcox wrote:
> On Sat, Jan 10, 2026 at 04:02:03AM +0000, Al Viro wrote:
> > +++ b/Kbuild
> > @@ -45,13 +45,24 @@ kernel/sched/rq-offsets.s: $(offsets-file)
> >  $(rq-offsets-file): kernel/sched/rq-offsets.s FORCE
> >  	$(call filechk,offsets,__RQ_OFFSETS_H__)
> >  
> > +# generate kmem_cache_size.h
> > +
> > +kmem_cache_size-file := include/generated/kmem_cache_size.h
> > +
> > +targets += mm/kmem_cache_size.s
> > +
> > +mm/kmem_cache_size.s: $(rq-offsets-file)
> > +
> > +$(kmem_cache_size-file): mm/kmem_cache_size.s FORCE
> > +	$(call filechk,offsets,__KMEM_CACHE_SIZE_H__)
> > +
> >  # Check for missing system calls
> >  
> >  quiet_cmd_syscalls = CALL    $<
> >        cmd_syscalls = $(CONFIG_SHELL) $< $(CC) $(c_flags) $(missing_syscalls_flags)
> >  
> >  PHONY += missing-syscalls
> > -missing-syscalls: scripts/checksyscalls.sh $(rq-offsets-file)
> > +missing-syscalls: scripts/checksyscalls.sh $(kmem_cache_size-file)
> >  	$(call cmd,syscalls)
> 
> Did you mean to _replace_  rq-offsets-file rather than add
> kmem_cache_size-file ?

Insert kmem_cache_size-file into the chain, actually.  At the moment, mainline has
$(bounds-file): kernel/bounds.s FORCE
        $(call filechk,offsets,__LINUX_BOUNDS_H__)

$(timeconst-file): kernel/time/timeconst.bc FORCE
        $(call filechk,gentimeconst)

arch/$(SRCARCH)/kernel/asm-offsets.s: $(timeconst-file) $(bounds-file)

$(offsets-file): arch/$(SRCARCH)/kernel/asm-offsets.s FORCE
        $(call filechk,offsets,__ASM_OFFSETS_H__)

kernel/sched/rq-offsets.s: $(offsets-file)

$(rq-offsets-file): kernel/sched/rq-offsets.s FORCE
        $(call filechk,offsets,__RQ_OFFSETS_H__)

missing-syscalls: scripts/checksyscalls.sh $(rq-offsets-file)
        $(call cmd,syscalls)

with prepare having deps on $(offsets-file) and missing-syscalls, which
orders the entire sequence.