slub: avoid list_lock contention from __refill_objects_any()

[PATCH] slub: avoid list_lock contention from __refill_objects_any()

Posted by Vlastimil Babka 1 week, 3 days ago

Kernel test robot has reported a regression in the patch "slab: refill
sheaves from all nodes". When taken in isolation like this, there is
indeed a tradeoff - we prefer to use remote objects prior to allocating
new local slabs. It is replicating a behavior that existed before
sheaves for replenishing cpu (partial) slabs - now called
get_from_any_partial() to allocate a single object.

So the possibility of allocating remote objects is intended even if
remote accesses are then slower. But the profiles in the report also
suggested a contention on the list_lock spinlock. And that's something
we can try to avoid without much tradeoff - if someone else has the
spin_lock, it's more likely they are allocating from the node than
freeing to it, so we can skip it even if it means allocating a new local
slab - contributing to that lock's contention isn't worth it. It should
not result in partial slabs accumulating on the remote node.

Thus add an allow_spin parameter to __refill_objects_node() and
get_partial_node_bulk() to make the attempts from __refill_objects_any()
use only a trylock.

Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
To be applied on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-7.0/sheaves
---
 mm/slub.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index eb1f52a79999..ca3db3ae1afb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3378,7 +3378,8 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
 
 static bool get_partial_node_bulk(struct kmem_cache *s,
 				  struct kmem_cache_node *n,
-				  struct partial_bulk_context *pc)
+				  struct partial_bulk_context *pc,
+				  bool allow_spin)
 {
 	struct slab *slab, *slab2;
 	unsigned int total_free = 0;
@@ -3390,7 +3391,10 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
 
 	INIT_LIST_HEAD(&pc->slabs);
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (allow_spin)
+		spin_lock_irqsave(&n->list_lock, flags);
+	else if (!spin_trylock_irqsave(&n->list_lock, flags))
+		return false;
 
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
 		struct freelist_counters flc;
@@ -6544,7 +6548,8 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
 
 static unsigned int
 __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
-		      unsigned int max, struct kmem_cache_node *n)
+		      unsigned int max, struct kmem_cache_node *n,
+		      bool allow_spin)
 {
 	struct partial_bulk_context pc;
 	struct slab *slab, *slab2;
@@ -6556,7 +6561,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
 	pc.min_objects = min;
 	pc.max_objects = max;
 
-	if (!get_partial_node_bulk(s, n, &pc))
+	if (!get_partial_node_bulk(s, n, &pc, allow_spin))
 		return 0;
 
 	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
@@ -6650,7 +6655,8 @@ __refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min
 					n->nr_partial <= s->min_partial)
 				continue;
 
-			r = __refill_objects_node(s, p, gfp, min, max, n);
+			r = __refill_objects_node(s, p, gfp, min, max, n,
+						  /* allow_spin = */ false);
 			refilled += r;
 
 			if (r >= min) {
@@ -6691,7 +6697,8 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 		return 0;
 
 	refilled = __refill_objects_node(s, p, gfp, min, max,
-					 get_node(s, local_node));
+					 get_node(s, local_node),
+					 /* allow_spin = */ true);
 	if (refilled >= min)
 		return refilled;
 

---
base-commit: 6f1912181ddfcf851a6670b4fa9c7dfdaf3ed46d
change-id: 20260129-b4-refill_any_trylock-160a31224193

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>

Re: [PATCH] slub: avoid list_lock contention from __refill_objects_any()

Posted by Hao Li 1 week, 3 days ago

On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote:
> Kernel test robot has reported a regression in the patch "slab: refill
> sheaves from all nodes". When taken in isolation like this, there is
> indeed a tradeoff - we prefer to use remote objects prior to allocating
> new local slabs. It is replicating a behavior that existed before
> sheaves for replenishing cpu (partial) slabs - now called
> get_from_any_partial() to allocate a single object.
> 
> So the possibility of allocating remote objects is intended even if
> remote accesses are then slower. But the profiles in the report also
> suggested a contention on the list_lock spinlock. And that's something
> we can try to avoid without much tradeoff - if someone else has the
> spin_lock, it's more likely they are allocating from the node than
> freeing to it, so we can skip it even if it means allocating a new local
> slab - contributing to that lock's contention isn't worth it. It should
> not result in partial slabs accumulating on the remote node.
> 
> Thus add an allow_spin parameter to __refill_objects_node() and
> get_partial_node_bulk() to make the attempts from __refill_objects_any()
> use only a trylock.
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

In my testing, this patch improved performance by:

will-it-scale.64.processes +14.2%
will-it-scale.128.processes +9.6%
will-it-scale.192.processes +10.8%
will-it-scale.per_process_ops +11.6%


Tested-by: Hao Li <hao.li@linux.dev>

-- 
Thanks
Hao

> ---
> To be applied on top of:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-7.0/sheaves
> ---
>  mm/slub.c | 19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index eb1f52a79999..ca3db3ae1afb 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3378,7 +3378,8 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>  
>  static bool get_partial_node_bulk(struct kmem_cache *s,
>  				  struct kmem_cache_node *n,
> -				  struct partial_bulk_context *pc)
> +				  struct partial_bulk_context *pc,
> +				  bool allow_spin)
>  {
>  	struct slab *slab, *slab2;
>  	unsigned int total_free = 0;
> @@ -3390,7 +3391,10 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
>  
>  	INIT_LIST_HEAD(&pc->slabs);
>  
> -	spin_lock_irqsave(&n->list_lock, flags);
> +	if (allow_spin)
> +		spin_lock_irqsave(&n->list_lock, flags);
> +	else if (!spin_trylock_irqsave(&n->list_lock, flags))
> +		return false;
>  
>  	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
>  		struct freelist_counters flc;
> @@ -6544,7 +6548,8 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
>  
>  static unsigned int
>  __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> -		      unsigned int max, struct kmem_cache_node *n)
> +		      unsigned int max, struct kmem_cache_node *n,
> +		      bool allow_spin)
>  {
>  	struct partial_bulk_context pc;
>  	struct slab *slab, *slab2;
> @@ -6556,7 +6561,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
>  	pc.min_objects = min;
>  	pc.max_objects = max;
>  
> -	if (!get_partial_node_bulk(s, n, &pc))
> +	if (!get_partial_node_bulk(s, n, &pc, allow_spin))
>  		return 0;
>  
>  	list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> @@ -6650,7 +6655,8 @@ __refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min
>  					n->nr_partial <= s->min_partial)
>  				continue;
>  
> -			r = __refill_objects_node(s, p, gfp, min, max, n);
> +			r = __refill_objects_node(s, p, gfp, min, max, n,
> +						  /* allow_spin = */ false);
>  			refilled += r;
>  
>  			if (r >= min) {
> @@ -6691,7 +6697,8 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
>  		return 0;
>  
>  	refilled = __refill_objects_node(s, p, gfp, min, max,
> -					 get_node(s, local_node));
> +					 get_node(s, local_node),
> +					 /* allow_spin = */ true);
>  	if (refilled >= min)
>  		return refilled;
>  
> 
> ---
> base-commit: 6f1912181ddfcf851a6670b4fa9c7dfdaf3ed46d
> change-id: 20260129-b4-refill_any_trylock-160a31224193
> 
> Best regards,
> -- 
> Vlastimil Babka <vbabka@suse.cz>
>

Re: [PATCH] slub: avoid list_lock contention from __refill_objects_any()

Posted by Harry Yoo 1 week, 3 days ago

On Thu, Jan 29, 2026 at 05:21:21PM +0800, Hao Li wrote:
> On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote:
> > Kernel test robot has reported a regression in the patch "slab: refill
> > sheaves from all nodes". When taken in isolation like this, there is
> > indeed a tradeoff - we prefer to use remote objects prior to allocating
> > new local slabs. It is replicating a behavior that existed before
> > sheaves for replenishing cpu (partial) slabs - now called
> > get_from_any_partial() to allocate a single object.
> > 
> > So the possibility of allocating remote objects is intended even if
> > remote accesses are then slower. But the profiles in the report also
> > suggested a contention on the list_lock spinlock. And that's something
> > we can try to avoid without much tradeoff - if someone else has the
> > spin_lock, it's more likely they are allocating from the node than
> > freeing to it, so we can skip it even if it means allocating a new local
> > slab - contributing to that lock's contention isn't worth it. It should
> > not result in partial slabs accumulating on the remote node.
> > 
> > Thus add an allow_spin parameter to __refill_objects_node() and
> > get_partial_node_bulk() to make the attempts from __refill_objects_any()
> > use only a trylock.
> > 
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com
> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> In my testing, this patch improved performance by:
> 
> will-it-scale.64.processes +14.2%
> will-it-scale.128.processes +9.6%
> will-it-scale.192.processes +10.8%
> will-it-scale.per_process_ops +11.6%
>
> Tested-by: Hao Li <hao.li@linux.dev>

I wonder if using spin_is_contended() or spin_is_locked()
would be better than trylock by avoiding an atomic operation?

-- 
Cheers,
Harry / Hyeonggon

Re: [PATCH] slub: avoid list_lock contention from __refill_objects_any()

Posted by Vlastimil Babka 1 week, 3 days ago

On 1/29/26 10:30, Harry Yoo wrote:
> On Thu, Jan 29, 2026 at 05:21:21PM +0800, Hao Li wrote:
>> On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote:
>> > Kernel test robot has reported a regression in the patch "slab: refill
>> > sheaves from all nodes". When taken in isolation like this, there is
>> > indeed a tradeoff - we prefer to use remote objects prior to allocating
>> > new local slabs. It is replicating a behavior that existed before
>> > sheaves for replenishing cpu (partial) slabs - now called
>> > get_from_any_partial() to allocate a single object.
>> > 
>> > So the possibility of allocating remote objects is intended even if
>> > remote accesses are then slower. But the profiles in the report also
>> > suggested a contention on the list_lock spinlock. And that's something
>> > we can try to avoid without much tradeoff - if someone else has the
>> > spin_lock, it's more likely they are allocating from the node than
>> > freeing to it, so we can skip it even if it means allocating a new local
>> > slab - contributing to that lock's contention isn't worth it. It should
>> > not result in partial slabs accumulating on the remote node.
>> > 
>> > Thus add an allow_spin parameter to __refill_objects_node() and
>> > get_partial_node_bulk() to make the attempts from __refill_objects_any()
>> > use only a trylock.
>> > 
>> > Reported-by: kernel test robot <oliver.sang@intel.com>
>> > Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com
>> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> 
>> In my testing, this patch improved performance by:
>> 
>> will-it-scale.64.processes +14.2%
>> will-it-scale.128.processes +9.6%
>> will-it-scale.192.processes +10.8%
>> will-it-scale.per_process_ops +11.6%
>>
>> Tested-by: Hao Li <hao.li@linux.dev>
> 
> I wonder if using spin_is_contended() or spin_is_locked()
> would be better than trylock by avoiding an atomic operation?

I checked and found that spin_trylock() itself implements a non-atomic check
before the atomic. So adding a spin_is_locked() would only help the caller
bail out a bit faster, but this is not a fastpath. It wouldn't help the
cache coherency traffic, AFAIU.

Re: [PATCH] slub: avoid list_lock contention from __refill_objects_any()

Posted by Harry Yoo 1 week, 3 days ago

On Thu, Jan 29, 2026 at 11:39:04AM +0100, Vlastimil Babka wrote:
> On 1/29/26 10:30, Harry Yoo wrote:
> > On Thu, Jan 29, 2026 at 05:21:21PM +0800, Hao Li wrote:
> >> On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote:
> >> > Kernel test robot has reported a regression in the patch "slab: refill
> >> > sheaves from all nodes". When taken in isolation like this, there is
> >> > indeed a tradeoff - we prefer to use remote objects prior to allocating
> >> > new local slabs. It is replicating a behavior that existed before
> >> > sheaves for replenishing cpu (partial) slabs - now called
> >> > get_from_any_partial() to allocate a single object.
> >> > 
> >> > So the possibility of allocating remote objects is intended even if
> >> > remote accesses are then slower. But the profiles in the report also
> >> > suggested a contention on the list_lock spinlock. And that's something
> >> > we can try to avoid without much tradeoff - if someone else has the
> >> > spin_lock, it's more likely they are allocating from the node than
> >> > freeing to it, so we can skip it even if it means allocating a new local
> >> > slab - contributing to that lock's contention isn't worth it. It should
> >> > not result in partial slabs accumulating on the remote node.
> >> > 
> >> > Thus add an allow_spin parameter to __refill_objects_node() and
> >> > get_partial_node_bulk() to make the attempts from __refill_objects_any()
> >> > use only a trylock.
> >> > 
> >> > Reported-by: kernel test robot <oliver.sang@intel.com>
> >> > Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com
> >> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >> 
> >> In my testing, this patch improved performance by:
> >> 
> >> will-it-scale.64.processes +14.2%
> >> will-it-scale.128.processes +9.6%
> >> will-it-scale.192.processes +10.8%
> >> will-it-scale.per_process_ops +11.6%
> >>
> >> Tested-by: Hao Li <hao.li@linux.dev>
> > 
> > I wonder if using spin_is_contended() or spin_is_locked()
> > would be better than trylock by avoiding an atomic operation?
> 
> I checked and found that spin_trylock() itself implements a non-atomic check
> before the atomic. So adding a spin_is_locked() would only help the caller
> bail out a bit faster, but this is not a fastpath. It wouldn't help the
> cache coherency traffic, AFAIU.

I looked at qspinlock version of spin_trylock() and you're right :)
I just assumed it will always do a CAS but it's not the case!

-- 
Cheers,
Harry / Hyeonggon