[v2] mm/slub: allocate sheaves on local memory nodes

[PATCH v2] mm/slub: allocate sheaves on local memory nodes

Posted by Hao Li 6 days, 19 hours ago

Sheaf structs are exchanged through node-local barns. Since barn structs
are already allocated from their local NUMA node, this patch aims to
allocate sheaf structs from their local memory nodes as well.

To achieve this, the obvious choice would be using cpu_to_mem().
However, init_percpu_sheaves() and bootstrap_cache_sheaves() iterate
through possible CPUs, whereas cpu_to_mem() is only initialized for
online CPUs. Therefore, we cannot use cpu_to_mem() and instead need to
use local_memory_node(cpu_to_node(cpu)), similar to what
__build_all_zonelists() does.

The primary goal of this patch is to improve NUMA node locality.
Although the actual performance impact is minor, it still yields a ~1%
improvement on a 192-core, 8-NUMA-node system when testing with the
will-it-scale mmap test case.

Signed-off-by: Hao Li <hao.li@linux.dev>
---
Changes in v2:
- Make init_percpu_sheaves() use a NUMA-aware sheaf struct allocation too.
  (Thanks Harry)
- Rebase on latest code.

v1: https://lore.kernel.org/linux-mm/20260525082312.16012-1-hao.li@linux.dev/

---
 mm/slub.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index cbf6636a3dad..7d36e09ae216 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2757,7 +2757,7 @@ static inline void *setup_object(struct kmem_cache *s, void *object)
 }
 
 static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
-					      unsigned int capacity)
+					      unsigned int capacity, int node)
 {
 	struct slab_sheaf *sheaf;
 	size_t sheaf_size;
@@ -2771,7 +2771,7 @@ static struct slab_sheaf *__alloc_empty_sheaf(struct kmem_cache *s, gfp_t gfp,
 		gfp |= __GFP_NO_OBJ_EXT;
 
 	sheaf_size = struct_size(sheaf, objects, capacity);
-	sheaf = kzalloc(sheaf_size, gfp);
+	sheaf = kzalloc_node(sheaf_size, gfp, node);
 
 	if (unlikely(!sheaf))
 		return NULL;
@@ -2791,7 +2791,7 @@ static inline struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s,
 
 	gfp &= ~OBJCGS_CLEAR_MASK;
 
-	return __alloc_empty_sheaf(s, gfp, s->sheaf_capacity);
+	return __alloc_empty_sheaf(s, gfp, s->sheaf_capacity, numa_mem_id());
 }
 
 static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
@@ -5014,7 +5014,7 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
 
 	if (unlikely(size > s->sheaf_capacity)) {
 
-		sheaf = __alloc_empty_sheaf(s, gfp, size);
+		sheaf = __alloc_empty_sheaf(s, gfp, size, numa_mem_id());
 		if (!sheaf)
 			return NULL;
 
@@ -7575,6 +7575,7 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 
 	for_each_possible_cpu(cpu) {
 		struct slub_percpu_sheaves *pcs;
+		int mem_node;
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
@@ -7598,10 +7599,13 @@ static int init_percpu_sheaves(struct kmem_cache *s)
 		 * For kmalloc caches it's used temporarily during the initial
 		 * bootstrap.
 		 */
-		if (!s->sheaf_capacity)
+		if (!s->sheaf_capacity) {
 			pcs->main = &bootstrap_sheaf;
-		else
-			pcs->main = alloc_empty_sheaf(s, GFP_KERNEL);
+		} else {
+			mem_node = local_memory_node(cpu_to_node(cpu));
+			pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL,
+					s->sheaf_capacity, mem_node);
+		}
 
 		if (!pcs->main)
 			return -ENOMEM;
@@ -8465,10 +8469,17 @@ static void __init bootstrap_cache_sheaves(struct kmem_cache *s)
 
 	for_each_possible_cpu(cpu) {
 		struct slub_percpu_sheaves *pcs;
+		int mem_node;
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL, capacity);
+		/*
+		 * Cannot use cpu_to_mem() here because it's only initialized
+		 * for online CPUs at this point (see __build_all_zonelists),
+		 * while we need to allocate sheaves for all possible CPUs.
+		 */
+		mem_node = local_memory_node(cpu_to_node(cpu));
+		pcs->main = __alloc_empty_sheaf(s, GFP_KERNEL, capacity, mem_node);
 
 		if (!pcs->main) {
 			failed = true;
-- 
2.54.0

Re: [PATCH v2] mm/slub: allocate sheaves on local memory nodes

Posted by Harry Yoo 6 days, 18 hours ago

On 6/1/26 6:56 PM, Hao Li wrote:
> Sheaf structs are exchanged through node-local barns. Since barn structs
> are already allocated from their local NUMA node, this patch aims to
> allocate sheaf structs from their local memory nodes as well.
> 
> To achieve this, the obvious choice would be using cpu_to_mem().
> However, init_percpu_sheaves() and bootstrap_cache_sheaves() iterate
> through possible CPUs, whereas cpu_to_mem() is only initialized for
> online CPUs. Therefore, we cannot use cpu_to_mem() and instead need to
> use local_memory_node(cpu_to_node(cpu)), similar to what
> __build_all_zonelists() does.
> 
> The primary goal of this patch is to improve NUMA node locality.
> Although the actual performance impact is minor, it still yields a ~1%
> improvement on a 192-core, 8-NUMA-node system when testing with the
> will-it-scale mmap test case.

Oh, nice :)

I have a question though...

I wonder if would be better to handle this by e.g.) not returning empty
sheaves back to barn and freeing them if the node id doesn't match and
it's not a memoryless node.

init_percpu_sheaves() and bootstrap_cache_sheaves() are not the only
places that can allocate sheaves from remote nodes; sheaves allocation
could fall back to other nodes and then SLUB could keep reusing those
sheaves from remote nodes even after memory is reclaimed.

If this works well, we probably don't need to handle it in
init_percpu_sheaves() and bootstrap_cache_sheaves() at all as they will
eventually be freed, while covering the other case too?

> Signed-off-by: Hao Li <hao.li@linux.dev>
> ---
> Changes in v2:
> - Make init_percpu_sheaves() use a NUMA-aware sheaf struct allocation too.
>   (Thanks Harry)
> - Rebase on latest code.
> 
> v1: https://lore.kernel.org/linux-mm/20260525082312.16012-1-hao.li@linux.dev/

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH v2] mm/slub: allocate sheaves on local memory nodes

Posted by Hao Li 5 days, 1 hour ago

On Mon, Jun 01, 2026 at 08:28:16PM +0900, Harry Yoo wrote:
> On 6/1/26 6:56 PM, Hao Li wrote:
> > Sheaf structs are exchanged through node-local barns. Since barn structs
> > are already allocated from their local NUMA node, this patch aims to
> > allocate sheaf structs from their local memory nodes as well.
> > 
> > To achieve this, the obvious choice would be using cpu_to_mem().
> > However, init_percpu_sheaves() and bootstrap_cache_sheaves() iterate
> > through possible CPUs, whereas cpu_to_mem() is only initialized for
> > online CPUs. Therefore, we cannot use cpu_to_mem() and instead need to
> > use local_memory_node(cpu_to_node(cpu)), similar to what
> > __build_all_zonelists() does.
> > 
> > The primary goal of this patch is to improve NUMA node locality.
> > Although the actual performance impact is minor, it still yields a ~1%
> > improvement on a 192-core, 8-NUMA-node system when testing with the
> > will-it-scale mmap test case.
> 
> Oh, nice :)
> 
> I have a question though...
> 
> I wonder if would be better to handle this by e.g.) not returning empty
> sheaves back to barn and freeing them if the node id doesn't match and
> it's not a memoryless node.
> 
> init_percpu_sheaves() and bootstrap_cache_sheaves() are not the only
> places that can allocate sheaves from remote nodes; sheaves allocation
> could fall back to other nodes and then SLUB could keep reusing those
> sheaves from remote nodes even after memory is reclaimed.

This is a good catch. In addition to the fallback mechanism, task migration
between CPUs in __pcs_replace_empty_main() and __pcs_replace_full_main() can
also mix up sheaf structs across different barns. So yeah, changing allocation
locality is not a silver bullet.

> 
> If this works well, we probably don't need to handle it in
> init_percpu_sheaves() and bootstrap_cache_sheaves() at all as they will
> eventually be freed, while covering the other case too?

freeing the empty sheaf if the NUMA node mismatches instead of putting it back
into the barn is indeed a good idea. I like it. But unfortunately, my testing
didn't show a clear performance improvement, though there was no noticeable
degradation either. :-(

I also did some more testing on my patch too. Under CONFIG_PREEMPT_LAZY, the
improvement is only about 0.5% (sometimes 1%). And when switching to
CONFIG_PREEMPT, the patch doesn't seem to yield statistically significant
benefits, likely because sheaves get mixed during task migration.

So, perhaps the performance gain just isn't worth the extra complexity. It's a
bit frustrating, but maybe we should just abandon this direction and keep
things as they are... :(

Thanks for the feedback anyway!

-- 
Thanks,
Hao