[v1] slab: replace cpu (partial) slabs with sheaves

[PATCH RFC 09/19] slab: add optimized sheaf refill from partial list

Posted by Vlastimil Babka 3 months, 2 weeks ago

At this point we have sheaves enabled for all caches, but their refill
is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
slabs - now a redundant caching layer that we are about to remove.

The refill will thus be done from slabs on the node partial list.
Introduce new functions that can do that in an optimized way as it's
easier than modifying the __kmem_cache_alloc_bulk() call chain.

Extend struct partial_context so it can return a list of slabs from the
partial list with the sum of free objects in them within the requested
min and max.

Introduce get_partial_node_bulk() that removes the slabs from freelist
and returns them in the list.

Introduce get_freelist_nofreeze() which grabs the freelist without
freezing the slab.

Introduce __refill_objects() that uses the functions above to fill an
array of objects. It has to handle the possibility that the slabs will
contain more objects that were requested, due to concurrent freeing of
objects to those slabs. When no more slabs on partial lists are
available, it will allocate new slabs.

Finally, switch refill_sheaf() to use __refill_objects().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 230 insertions(+), 5 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index a84027fbca78..e2b052657d11 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -246,6 +246,9 @@ struct partial_context {
 	gfp_t flags;
 	unsigned int orig_size;
 	void *object;
+	unsigned int min_objects;
+	unsigned int max_objects;
+	struct list_head slabs;
 };
 
 static inline bool kmem_cache_debug(struct kmem_cache *s)
@@ -2633,9 +2636,9 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 	stat(s, SHEAF_FREE);
 }
 
-static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
-				   size_t size, void **p);
-
+static unsigned int
+__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+		 unsigned int max);
 
 static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
 			 gfp_t gfp)
@@ -2646,8 +2649,8 @@ static int refill_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
 	if (!to_fill)
 		return 0;
 
-	filled = __kmem_cache_alloc_bulk(s, gfp, to_fill,
-					 &sheaf->objects[sheaf->size]);
+	filled = __refill_objects(s, &sheaf->objects[sheaf->size], gfp,
+			to_fill, to_fill);
 
 	sheaf->size += filled;
 
@@ -3508,6 +3511,69 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
 #endif
 static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
 
+static bool get_partial_node_bulk(struct kmem_cache *s,
+				  struct kmem_cache_node *n,
+				  struct partial_context *pc)
+{
+	struct slab *slab, *slab2;
+	unsigned int total_free = 0;
+	unsigned long flags;
+
+	/*
+	 * Racy check. If we mistakenly see no partial slabs then we
+	 * just allocate an empty slab. If we mistakenly try to get a
+	 * partial slab and there is none available then get_partial()
+	 * will return NULL.
+	 */
+	if (!n || !n->nr_partial)
+		return false;
+
+	INIT_LIST_HEAD(&pc->slabs);
+
+	if (gfpflags_allow_spinning(pc->flags))
+		spin_lock_irqsave(&n->list_lock, flags);
+	else if (!spin_trylock_irqsave(&n->list_lock, flags))
+		return false;
+
+	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
+		struct slab slab_counters;
+		unsigned int slab_free;
+
+		if (!pfmemalloc_match(slab, pc->flags))
+			continue;
+
+		/*
+		 * due to atomic updates done by a racing free we should not
+		 * read garbage here, but do a sanity check anyway
+		 *
+		 * slab_free is a lower bound due to subsequent concurrent
+		 * freeing, the caller might get more objects than requested and
+		 * must deal with it
+		 */
+		slab_counters.counters = data_race(READ_ONCE(slab->counters));
+		slab_free = slab_counters.objects - slab_counters.inuse;
+
+		if (unlikely(slab_free > oo_objects(s->oo)))
+			continue;
+
+		/* we have already min and this would get us over the max */
+		if (total_free >= pc->min_objects
+		    && total_free + slab_free > pc->max_objects)
+			continue;
+
+		remove_partial(n, slab);
+
+		list_add(&slab->slab_list, &pc->slabs);
+
+		total_free += slab_free;
+		if (total_free >= pc->max_objects)
+			break;
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return total_free > 0;
+}
+
 /*
  * Try to allocate a partial slab from a specific node.
  */
@@ -4436,6 +4502,38 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
 	return freelist;
 }
 
+/*
+ * Get the slab's freelist and do not freeze it.
+ *
+ * Assumes the slab is isolated from node partial list and not frozen.
+ *
+ * Assumes this is performed only for caches without debugging so we
+ * don't need to worry about adding the slab to the full list
+ */
+static inline void *get_freelist_nofreeze(struct kmem_cache *s, struct slab *slab)
+{
+	struct slab new;
+	unsigned long counters;
+	void *freelist;
+
+	do {
+		freelist = slab->freelist;
+		counters = slab->counters;
+
+		new.counters = counters;
+		VM_BUG_ON(new.frozen);
+
+		new.inuse = slab->objects;
+		new.frozen = 0;
+
+	} while (!slab_update_freelist(s, slab,
+		freelist, counters,
+		NULL, new.counters,
+		"get_freelist_nofreeze"));
+
+	return freelist;
+}
+
 /*
  * Freeze the partial slab and return the pointer to the freelist.
  */
@@ -5373,6 +5471,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
 	return ret;
 }
 
+static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
+				   size_t size, void **p);
+
 /*
  * returns a sheaf that has at least the requested size
  * when prefilling is needed, do so with given gfp flags
@@ -7409,6 +7510,130 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
 }
 EXPORT_SYMBOL(kmem_cache_free_bulk);
 
+static unsigned int
+__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
+		 unsigned int max)
+{
+	struct slab *slab, *slab2;
+	struct partial_context pc;
+	unsigned int refilled = 0;
+	unsigned long flags;
+	void *object;
+	int node;
+
+	pc.flags = gfp;
+	pc.min_objects = min;
+	pc.max_objects = max;
+
+	node = numa_mem_id();
+
+	/* TODO: consider also other nodes? */
+	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
+		goto new_slab;
+
+	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+		list_del(&slab->slab_list);
+
+		object = get_freelist_nofreeze(s, slab);
+
+		while (object && refilled < max) {
+			p[refilled] = object;
+			object = get_freepointer(s, object);
+			maybe_wipe_obj_freeptr(s, p[refilled]);
+
+			refilled++;
+		}
+
+		/*
+		 * Freelist had more objects than we can accomodate, we need to
+		 * free them back. We can treat it like a detached freelist, just
+		 * need to find the tail object.
+		 */
+		if (unlikely(object)) {
+			void *head = object;
+			void *tail;
+			int cnt = 0;
+
+			do {
+				tail = object;
+				cnt++;
+				object = get_freepointer(s, object);
+			} while (object);
+			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
+		}
+
+		if (refilled >= max)
+			break;
+	}
+
+	if (unlikely(!list_empty(&pc.slabs))) {
+		struct kmem_cache_node *n = get_node(s, node);
+
+		spin_lock_irqsave(&n->list_lock, flags);
+
+		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+			if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial))
+				continue;
+
+			list_del(&slab->slab_list);
+			add_partial(n, slab, DEACTIVATE_TO_HEAD);
+		}
+
+		spin_unlock_irqrestore(&n->list_lock, flags);
+
+		/* any slabs left are completely free and for discard */
+		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
+
+			list_del(&slab->slab_list);
+			discard_slab(s, slab);
+		}
+	}
+
+
+	if (likely(refilled >= min))
+		goto out;
+
+new_slab:
+
+	slab = new_slab(s, pc.flags, node);
+	if (!slab)
+		goto out;
+
+	stat(s, ALLOC_SLAB);
+	inc_slabs_node(s, slab_nid(slab), slab->objects);
+
+	/*
+	 * TODO: possible optimization - if we know we will consume the whole
+	 * slab we might skip creating the freelist?
+	 */
+	object = slab->freelist;
+	while (object && refilled < max) {
+		p[refilled] = object;
+		object = get_freepointer(s, object);
+		maybe_wipe_obj_freeptr(s, p[refilled]);
+
+		slab->inuse++;
+		refilled++;
+	}
+	slab->freelist = object;
+
+	if (slab->freelist) {
+		struct kmem_cache_node *n = get_node(s, slab_nid(slab));
+
+		spin_lock_irqsave(&n->list_lock, flags);
+		add_partial(n, slab, DEACTIVATE_TO_HEAD);
+		spin_unlock_irqrestore(&n->list_lock, flags);
+	}
+
+	if (refilled < min)
+		goto new_slab;
+out:
+
+	return refilled;
+}
+
 static inline
 int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
 			    void **p)

-- 
2.51.1

Re: [PATCH RFC 09/19] slab: add optimized sheaf refill from partial list

Posted by Harry Yoo 3 months, 2 weeks ago

On Thu, Oct 23, 2025 at 03:52:31PM +0200, Vlastimil Babka wrote:
> At this point we have sheaves enabled for all caches, but their refill
> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> slabs - now a redundant caching layer that we are about to remove.
> 
> The refill will thus be done from slabs on the node partial list.
> Introduce new functions that can do that in an optimized way as it's
> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> 
> Extend struct partial_context so it can return a list of slabs from the
> partial list with the sum of free objects in them within the requested
> min and max.
> 
> Introduce get_partial_node_bulk() that removes the slabs from freelist
> and returns them in the list.
> 
> Introduce get_freelist_nofreeze() which grabs the freelist without
> freezing the slab.
> 
> Introduce __refill_objects() that uses the functions above to fill an
> array of objects. It has to handle the possibility that the slabs will
> contain more objects that were requested, due to concurrent freeing of
> objects to those slabs. When no more slabs on partial lists are
> available, it will allocate new slabs.
> 
> Finally, switch refill_sheaf() to use __refill_objects().
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/slub.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 230 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index a84027fbca78..e2b052657d11 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3508,6 +3511,69 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
>  #endif
>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>  
> +static bool get_partial_node_bulk(struct kmem_cache *s,
> +				  struct kmem_cache_node *n,
> +				  struct partial_context *pc)
> +{
> +	struct slab *slab, *slab2;
> +	unsigned int total_free = 0;
> +	unsigned long flags;
> +
> +	/*
> +	 * Racy check. If we mistakenly see no partial slabs then we
> +	 * just allocate an empty slab. If we mistakenly try to get a
> +	 * partial slab and there is none available then get_partial()
> +	 * will return NULL.
> +	 */
> +	if (!n || !n->nr_partial)
> +		return false;
> +
> +	INIT_LIST_HEAD(&pc->slabs);
> +
> +	if (gfpflags_allow_spinning(pc->flags))
> +		spin_lock_irqsave(&n->list_lock, flags);
> +	else if (!spin_trylock_irqsave(&n->list_lock, flags))
> +		return false;
> +
> +	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> +		struct slab slab_counters;
> +		unsigned int slab_free;
> +
> +		if (!pfmemalloc_match(slab, pc->flags))
> +			continue;
> +
> +		/*
> +		 * due to atomic updates done by a racing free we should not
> +		 * read garbage here, but do a sanity check anyway
> +		 *
> +		 * slab_free is a lower bound due to subsequent concurrent
> +		 * freeing, the caller might get more objects than requested and
> +		 * must deal with it
> +		 */
> +		slab_counters.counters = data_race(READ_ONCE(slab->counters));
> +		slab_free = slab_counters.objects - slab_counters.inuse;
> +
> +		if (unlikely(slab_free > oo_objects(s->oo)))
> +			continue;
> +
> +		/* we have already min and this would get us over the max */
> +		if (total_free >= pc->min_objects
> +		    && total_free + slab_free > pc->max_objects)
> +			continue;
> +
> +		remove_partial(n, slab);
> +
> +		list_add(&slab->slab_list, &pc->slabs);
> +
> +		total_free += slab_free;
> +		if (total_free >= pc->max_objects)
> +			break;

It may end up iterating over all slabs in the n->partial list
when the sum of free objects isn't exactly equal to pc->max_objects?

> +	}
> +
> +	spin_unlock_irqrestore(&n->list_lock, flags);
> +	return total_free > 0;
> +}
> +
>  /*
>   * Try to allocate a partial slab from a specific node.
>   */
> @@ -4436,6 +4502,38 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
>  	return freelist;
>  }
>  
>  /*
>   * Freeze the partial slab and return the pointer to the freelist.
>   */
> @@ -5373,6 +5471,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
>  	return ret;
>  }
>  
> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> +				   size_t size, void **p);
> +
>  /*
>   * returns a sheaf that has at least the requested size
>   * when prefilling is needed, do so with given gfp flags
> @@ -7409,6 +7510,130 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>  }
>  EXPORT_SYMBOL(kmem_cache_free_bulk);
>  
> +static unsigned int
> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> +		 unsigned int max)
> +{
> +	struct slab *slab, *slab2;
> +	struct partial_context pc;
> +	unsigned int refilled = 0;
> +	unsigned long flags;
> +	void *object;
> +	int node;
> +
> +	pc.flags = gfp;
> +	pc.min_objects = min;
> +	pc.max_objects = max;
> +
> +	node = numa_mem_id();
> +
> +	/* TODO: consider also other nodes? */
> +	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> +		goto new_slab;
> +
> +	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +		list_del(&slab->slab_list);
> +
> +		object = get_freelist_nofreeze(s, slab);
> +
> +		while (object && refilled < max) {
> +			p[refilled] = object;
> +			object = get_freepointer(s, object);
> +			maybe_wipe_obj_freeptr(s, p[refilled]);
> +
> +			refilled++;
> +		}
> +
> +		/*
> +		 * Freelist had more objects than we can accomodate, we need to
> +		 * free them back. We can treat it like a detached freelist, just
> +		 * need to find the tail object.
> +		 */
> +		if (unlikely(object)) {
> +			void *head = object;
> +			void *tail;
> +			int cnt = 0;
> +
> +			do {
> +				tail = object;
> +				cnt++;
> +				object = get_freepointer(s, object);
> +			} while (object);
> +			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
> +		}

Maybe we don't have to do this if we put slabs into a singly linked list
and use the other word to record the number of objects in the slab.

> +
> +		if (refilled >= max)
> +			break;
> +	}
> +
> +	if (unlikely(!list_empty(&pc.slabs))) {
> +		struct kmem_cache_node *n = get_node(s, node);
> +
> +		spin_lock_irqsave(&n->list_lock, flags);

Do we surely know that trylock will succeed when
we succeeded to acquire it in get_partial_node_bulk()?

I think the answer is yes, but just to double check :)

> +		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +			if (unlikely(!slab->inuse && n->nr_partial >= s->min_partial))
> +				continue;
> +
> +			list_del(&slab->slab_list);
> +			add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +		}
> +
> +		spin_unlock_irqrestore(&n->list_lock, flags);
> +
> +		/* any slabs left are completely free and for discard */
> +		list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> +
> +			list_del(&slab->slab_list);
> +			discard_slab(s, slab);
> +		}
> +	}
> +
> +
> +	if (likely(refilled >= min))
> +		goto out;
> +
> +new_slab:
> +
> +	slab = new_slab(s, pc.flags, node);
> +	if (!slab)
> +		goto out;
> +
> +	stat(s, ALLOC_SLAB);
> +	inc_slabs_node(s, slab_nid(slab), slab->objects);
> +
> +	/*
> +	 * TODO: possible optimization - if we know we will consume the whole
> +	 * slab we might skip creating the freelist?
> +	 */
> +	object = slab->freelist;
> +	while (object && refilled < max) {
> +		p[refilled] = object;
> +		object = get_freepointer(s, object);
> +		maybe_wipe_obj_freeptr(s, p[refilled]);
> +
> +		slab->inuse++;
> +		refilled++;
> +	}
> +	slab->freelist = object;
> +
> +	if (slab->freelist) {
> +		struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> +
> +		spin_lock_irqsave(&n->list_lock, flags);
> +		add_partial(n, slab, DEACTIVATE_TO_HEAD);
> +		spin_unlock_irqrestore(&n->list_lock, flags);

If slab_nid(slab) != node, we should check gfpflags_allow_spinning()
and call defer_deactivate_slab() if it returns false?

> +	}
> +
> +	if (refilled < min)
> +		goto new_slab;
> +out:
> +
> +	return refilled;
> +}
> +
>  static inline
>  int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size,
>  			    void **p)
> 
> -- 
> 2.51.1
> 

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH RFC 09/19] slab: add optimized sheaf refill from partial list

Posted by Vlastimil Babka 3 months, 1 week ago

On 10/27/25 08:20, Harry Yoo wrote:
> On Thu, Oct 23, 2025 at 03:52:31PM +0200, Vlastimil Babka wrote:
>> At this point we have sheaves enabled for all caches, but their refill
>> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
>> slabs - now a redundant caching layer that we are about to remove.
>> 
>> The refill will thus be done from slabs on the node partial list.
>> Introduce new functions that can do that in an optimized way as it's
>> easier than modifying the __kmem_cache_alloc_bulk() call chain.
>> 
>> Extend struct partial_context so it can return a list of slabs from the
>> partial list with the sum of free objects in them within the requested
>> min and max.
>> 
>> Introduce get_partial_node_bulk() that removes the slabs from freelist
>> and returns them in the list.
>> 
>> Introduce get_freelist_nofreeze() which grabs the freelist without
>> freezing the slab.
>> 
>> Introduce __refill_objects() that uses the functions above to fill an
>> array of objects. It has to handle the possibility that the slabs will
>> contain more objects that were requested, due to concurrent freeing of
>> objects to those slabs. When no more slabs on partial lists are
>> available, it will allocate new slabs.
>> 
>> Finally, switch refill_sheaf() to use __refill_objects().
>> 
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/slub.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 230 insertions(+), 5 deletions(-)
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index a84027fbca78..e2b052657d11 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3508,6 +3511,69 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
>>  #endif
>>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>>  
>> +static bool get_partial_node_bulk(struct kmem_cache *s,
>> +				  struct kmem_cache_node *n,
>> +				  struct partial_context *pc)
>> +{
>> +	struct slab *slab, *slab2;
>> +	unsigned int total_free = 0;
>> +	unsigned long flags;
>> +
>> +	/*
>> +	 * Racy check. If we mistakenly see no partial slabs then we
>> +	 * just allocate an empty slab. If we mistakenly try to get a
>> +	 * partial slab and there is none available then get_partial()
>> +	 * will return NULL.
>> +	 */
>> +	if (!n || !n->nr_partial)
>> +		return false;
>> +
>> +	INIT_LIST_HEAD(&pc->slabs);
>> +
>> +	if (gfpflags_allow_spinning(pc->flags))
>> +		spin_lock_irqsave(&n->list_lock, flags);
>> +	else if (!spin_trylock_irqsave(&n->list_lock, flags))
>> +		return false;
>> +
>> +	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
>> +		struct slab slab_counters;
>> +		unsigned int slab_free;
>> +
>> +		if (!pfmemalloc_match(slab, pc->flags))
>> +			continue;
>> +
>> +		/*
>> +		 * due to atomic updates done by a racing free we should not
>> +		 * read garbage here, but do a sanity check anyway
>> +		 *
>> +		 * slab_free is a lower bound due to subsequent concurrent
>> +		 * freeing, the caller might get more objects than requested and
>> +		 * must deal with it
>> +		 */
>> +		slab_counters.counters = data_race(READ_ONCE(slab->counters));
>> +		slab_free = slab_counters.objects - slab_counters.inuse;
>> +
>> +		if (unlikely(slab_free > oo_objects(s->oo)))
>> +			continue;
>> +
>> +		/* we have already min and this would get us over the max */
>> +		if (total_free >= pc->min_objects
>> +		    && total_free + slab_free > pc->max_objects)
>> +			continue;

Hmm I think I meant to have break; here. Should deal with your concern below?

>> +		remove_partial(n, slab);
>> +
>> +		list_add(&slab->slab_list, &pc->slabs);
>> +
>> +		total_free += slab_free;
>> +		if (total_free >= pc->max_objects)
>> +			break;
> 
> It may end up iterating over all slabs in the n->partial list
> when the sum of free objects isn't exactly equal to pc->max_objects?

Good catch, thanks.

>> +	}
>> +
>> +	spin_unlock_irqrestore(&n->list_lock, flags);
>> +	return total_free > 0;
>> +}
>> +
>>  /*
>>   * Try to allocate a partial slab from a specific node.
>>   */
>> @@ -4436,6 +4502,38 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
>>  	return freelist;
>>  }
>>  
>>  /*
>>   * Freeze the partial slab and return the pointer to the freelist.
>>   */
>> @@ -5373,6 +5471,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
>>  	return ret;
>>  }
>>  
>> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
>> +				   size_t size, void **p);
>> +
>>  /*
>>   * returns a sheaf that has at least the requested size
>>   * when prefilling is needed, do so with given gfp flags
>> @@ -7409,6 +7510,130 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
>>  }
>>  EXPORT_SYMBOL(kmem_cache_free_bulk);
>>  
>> +static unsigned int
>> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>> +		 unsigned int max)
>> +{
>> +	struct slab *slab, *slab2;
>> +	struct partial_context pc;
>> +	unsigned int refilled = 0;
>> +	unsigned long flags;
>> +	void *object;
>> +	int node;
>> +
>> +	pc.flags = gfp;
>> +	pc.min_objects = min;
>> +	pc.max_objects = max;
>> +
>> +	node = numa_mem_id();
>> +
>> +	/* TODO: consider also other nodes? */
>> +	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
>> +		goto new_slab;
>> +
>> +	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
>> +
>> +		list_del(&slab->slab_list);
>> +
>> +		object = get_freelist_nofreeze(s, slab);
>> +
>> +		while (object && refilled < max) {
>> +			p[refilled] = object;
>> +			object = get_freepointer(s, object);
>> +			maybe_wipe_obj_freeptr(s, p[refilled]);
>> +
>> +			refilled++;
>> +		}
>> +
>> +		/*
>> +		 * Freelist had more objects than we can accomodate, we need to
>> +		 * free them back. We can treat it like a detached freelist, just
>> +		 * need to find the tail object.
>> +		 */
>> +		if (unlikely(object)) {
>> +			void *head = object;
>> +			void *tail;
>> +			int cnt = 0;
>> +
>> +			do {
>> +				tail = object;
>> +				cnt++;
>> +				object = get_freepointer(s, object);
>> +			} while (object);
>> +			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
>> +		}
> 
> Maybe we don't have to do this if we put slabs into a singly linked list
> and use the other word to record the number of objects in the slab.

You mean we wouldn't have to do the counting? I think it wouldn't help as
the number could become stale after we record it, due to concurrent freeing.
Maybe get_freelist_nofreeze() could return it together with the freelist as
it can get both atomically.
However the main reason for the loop is is not to count, but to find the
tail pointer, and I don't see a way around it?

>> +
>> +		if (refilled >= max)
>> +			break;
>> +	}
>> +
>> +	if (unlikely(!list_empty(&pc.slabs))) {
>> +		struct kmem_cache_node *n = get_node(s, node);
>> +
>> +		spin_lock_irqsave(&n->list_lock, flags);
> 
> Do we surely know that trylock will succeed when
> we succeeded to acquire it in get_partial_node_bulk()?
> 
> I think the answer is yes, but just to double check :)

Yeah as you corrected, answer is no. However I missed that
__pcs_replace_empty_main() will only let us reach here with
gfpflags_allow_blocking() true in the first place. So I didn't have to even
deal with gfpflags_allow_spinning() in get_partial_node_bulk() then. I think
it's the simplest solution.

(side note: gfpflags_allow_blocking() might be too conservative now that
sheafs will be the only caching layer, that condition could be perhaps
changed to gfpflags_allow_spinning() to allow some cheap refill).

Re: [PATCH RFC 09/19] slab: add optimized sheaf refill from partial list

Posted by Harry Yoo 3 months, 1 week ago

On Wed, Oct 29, 2025 at 09:48:27PM +0100, Vlastimil Babka wrote:
> On 10/27/25 08:20, Harry Yoo wrote:
> > On Thu, Oct 23, 2025 at 03:52:31PM +0200, Vlastimil Babka wrote:
> >> At this point we have sheaves enabled for all caches, but their refill
> >> is done via __kmem_cache_alloc_bulk() which relies on cpu (partial)
> >> slabs - now a redundant caching layer that we are about to remove.
> >> 
> >> The refill will thus be done from slabs on the node partial list.
> >> Introduce new functions that can do that in an optimized way as it's
> >> easier than modifying the __kmem_cache_alloc_bulk() call chain.
> >> 
> >> Extend struct partial_context so it can return a list of slabs from the
> >> partial list with the sum of free objects in them within the requested
> >> min and max.
> >> 
> >> Introduce get_partial_node_bulk() that removes the slabs from freelist
> >> and returns them in the list.
> >> 
> >> Introduce get_freelist_nofreeze() which grabs the freelist without
> >> freezing the slab.
> >> 
> >> Introduce __refill_objects() that uses the functions above to fill an
> >> array of objects. It has to handle the possibility that the slabs will
> >> contain more objects that were requested, due to concurrent freeing of
> >> objects to those slabs. When no more slabs on partial lists are
> >> available, it will allocate new slabs.
> >> 
> >> Finally, switch refill_sheaf() to use __refill_objects().
> >> 
> >> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >> ---
> >>  mm/slub.c | 235 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> >>  1 file changed, 230 insertions(+), 5 deletions(-)
> >> 
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index a84027fbca78..e2b052657d11 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -3508,6 +3511,69 @@ static inline void put_cpu_partial(struct kmem_cache *s, struct slab *slab,
> >>  #endif
> >>  static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
> >>  
> >> +static bool get_partial_node_bulk(struct kmem_cache *s,
> >> +				  struct kmem_cache_node *n,
> >> +				  struct partial_context *pc)
> >> +{
> >> +	struct slab *slab, *slab2;
> >> +	unsigned int total_free = 0;
> >> +	unsigned long flags;
> >> +
> >> +	/*
> >> +	 * Racy check. If we mistakenly see no partial slabs then we
> >> +	 * just allocate an empty slab. If we mistakenly try to get a
> >> +	 * partial slab and there is none available then get_partial()
> >> +	 * will return NULL.
> >> +	 */
> >> +	if (!n || !n->nr_partial)
> >> +		return false;
> >> +
> >> +	INIT_LIST_HEAD(&pc->slabs);
> >> +
> >> +	if (gfpflags_allow_spinning(pc->flags))
> >> +		spin_lock_irqsave(&n->list_lock, flags);
> >> +	else if (!spin_trylock_irqsave(&n->list_lock, flags))
> >> +		return false;
> >> +
> >> +	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> >> +		struct slab slab_counters;
> >> +		unsigned int slab_free;
> >> +
> >> +		if (!pfmemalloc_match(slab, pc->flags))
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * due to atomic updates done by a racing free we should not
> >> +		 * read garbage here, but do a sanity check anyway
> >> +		 *
> >> +		 * slab_free is a lower bound due to subsequent concurrent
> >> +		 * freeing, the caller might get more objects than requested and
> >> +		 * must deal with it
> >> +		 */
> >> +		slab_counters.counters = data_race(READ_ONCE(slab->counters));
> >> +		slab_free = slab_counters.objects - slab_counters.inuse;
> >> +
> >> +		if (unlikely(slab_free > oo_objects(s->oo)))
> >> +			continue;
> >> +
> >> +		/* we have already min and this would get us over the max */
> >> +		if (total_free >= pc->min_objects
> >> +		    && total_free + slab_free > pc->max_objects)
> >> +			continue;
> 
> Hmm I think I meant to have break; here. Should deal with your concern below?

Yes!

> >> +		remove_partial(n, slab);
> >> +
> >> +		list_add(&slab->slab_list, &pc->slabs);
> >> +
> >> +		total_free += slab_free;
> >> +		if (total_free >= pc->max_objects)
> >> +			break;
> > 
> > It may end up iterating over all slabs in the n->partial list
> > when the sum of free objects isn't exactly equal to pc->max_objects?
> 
> Good catch, thanks.
> 
> >> +	}
> >> +
> >> +	spin_unlock_irqrestore(&n->list_lock, flags);
> >> +	return total_free > 0;
> >> +}
> >> +
> >>  /*
> >>   * Try to allocate a partial slab from a specific node.
> >>   */
> >> @@ -4436,6 +4502,38 @@ static inline void *get_freelist(struct kmem_cache *s, struct slab *slab)
> >>  	return freelist;
> >>  }
> >>  
> >>  /*
> >>   * Freeze the partial slab and return the pointer to the freelist.
> >>   */
> >> @@ -5373,6 +5471,9 @@ static int __prefill_sheaf_pfmemalloc(struct kmem_cache *s,
> >>  	return ret;
> >>  }
> >>  
> >> +static int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags,
> >> +				   size_t size, void **p);
> >> +
> >>  /*
> >>   * returns a sheaf that has at least the requested size
> >>   * when prefilling is needed, do so with given gfp flags
> >> @@ -7409,6 +7510,130 @@ void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p)
> >>  }
> >>  EXPORT_SYMBOL(kmem_cache_free_bulk);
> >>  
> >> +static unsigned int
> >> +__refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> >> +		 unsigned int max)
> >> +{
> >> +	struct slab *slab, *slab2;
> >> +	struct partial_context pc;
> >> +	unsigned int refilled = 0;
> >> +	unsigned long flags;
> >> +	void *object;
> >> +	int node;
> >> +
> >> +	pc.flags = gfp;
> >> +	pc.min_objects = min;
> >> +	pc.max_objects = max;
> >> +
> >> +	node = numa_mem_id();
> >> +
> >> +	/* TODO: consider also other nodes? */
> >> +	if (!get_partial_node_bulk(s, get_node(s, node), &pc))
> >> +		goto new_slab;
> >> +
> >> +	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> >> +
> >> +		list_del(&slab->slab_list);
> >> +
> >> +		object = get_freelist_nofreeze(s, slab);
> >> +
> >> +		while (object && refilled < max) {
> >> +			p[refilled] = object;
> >> +			object = get_freepointer(s, object);
> >> +			maybe_wipe_obj_freeptr(s, p[refilled]);
> >> +
> >> +			refilled++;
> >> +		}
> >> +
> >> +		/*
> >> +		 * Freelist had more objects than we can accomodate, we need to
> >> +		 * free them back. We can treat it like a detached freelist, just
> >> +		 * need to find the tail object.
> >> +		 */
> >> +		if (unlikely(object)) {
> >> +			void *head = object;
> >> +			void *tail;
> >> +			int cnt = 0;
> >> +
> >> +			do {
> >> +				tail = object;
> >> +				cnt++;
> >> +				object = get_freepointer(s, object);
> >> +			} while (object);
> >> +			do_slab_free(s, slab, head, tail, cnt, _RET_IP_);
> >> +		}
> > 
> > Maybe we don't have to do this if we put slabs into a singly linked list
> > and use the other word to record the number of objects in the slab.
> 
> You mean we wouldn't have to do the counting?

Yes.

> I think it wouldn't help as
> the number could become stale after we record it, due to concurrent freeing.
> Maybe get_freelist_nofreeze() could return it together with the freelist as
> it can get both atomically.
>
> However the main reason for the loop is is not to count, but to find the
> tail pointer, and I don't see a way around it?

Uh, right. Nevermind then! I don't see a way around either.

> >> +
> >> +		if (refilled >= max)
> >> +			break;
> >> +	}
> >> +
> >> +	if (unlikely(!list_empty(&pc.slabs))) {
> >> +		struct kmem_cache_node *n = get_node(s, node);
> >> +
> >> +		spin_lock_irqsave(&n->list_lock, flags);
> > 
> > Do we surely know that trylock will succeed when
> > we succeeded to acquire it in get_partial_node_bulk()?
> > 
> > I think the answer is yes, but just to double check :)
> 
> Yeah as you corrected, answer is no. However I missed that
> __pcs_replace_empty_main() will only let us reach here with
> gfpflags_allow_blocking() true in the first place.

Oh right, it's already done before it's called!

As you mentioned, __pcs_replace_empty_main() already knows
gfpflags_allow_blocking() == true when calling refill_sheaf().

And bulk allocation, sheaf prefill/return cannot be called from
kmalloc/kfree_nolock() path.

> So I didn't have to even
> deal with gfpflags_allow_spinning() in get_partial_node_bulk() then. I think
> it's the simplest solution.

Right.

> (side note: gfpflags_allow_blocking() might be too conservative now that
> sheafs will be the only caching layer, that condition could be perhaps
> changed to gfpflags_allow_spinning() to allow some cheap refill).

Sounds good to me.

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH RFC 09/19] slab: add optimized sheaf refill from partial list

Posted by Vlastimil Babka 3 months, 1 week ago

On 10/30/25 01:07, Harry Yoo wrote:
> On Wed, Oct 29, 2025 at 09:48:27PM +0100, Vlastimil Babka wrote:
>> (side note: gfpflags_allow_blocking() might be too conservative now that
>> sheafs will be the only caching layer, that condition could be perhaps
>> changed to gfpflags_allow_spinning() to allow some cheap refill).
> 
> Sounds good to me.

Hm now I realized the gfpflags_allow_blocking() check is there to make sure
we can take the local lock without trylock after obtaining a full sheaf, so
we can install it - because it should mean we're not in an interrupt
context. The fact we already succeeded trylock earlier should be enough, but
we'd run again into inventing ugly tricks to make lockdep happy.

Or we use trylock and have failure paths that are only possible to hit on RT
in practice...

Re: [PATCH RFC 09/19] slab: add optimized sheaf refill from partial list

Posted by Harry Yoo 3 months, 2 weeks ago

On Mon, Oct 27, 2025 at 04:20:56PM +0900, Harry Yoo wrote:
> On Thu, Oct 23, 2025 at 03:52:31PM +0200, Vlastimil Babka wrote:
> > +	if (unlikely(!list_empty(&pc.slabs))) {
> > +		struct kmem_cache_node *n = get_node(s, node);
> > +
> > +		spin_lock_irqsave(&n->list_lock, flags);
> 
> Do we surely know that trylock will succeed when
> we succeeded to acquire it in get_partial_node_bulk()?
> 
> I think the answer is yes, but just to double check :)

Oh wait, this is not per-cpu lock, so the answer is no!
We need to check gfpflags_allow_spinning() before spinning then.

-- 
Cheers,
Harry / Hyeonggon