mm/slub.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)
Kernel test robot has reported a regression in the patch "slab: refill
sheaves from all nodes". When taken in isolation like this, there is
indeed a tradeoff - we prefer to use remote objects prior to allocating
new local slabs. It is replicating a behavior that existed before
sheaves for replenishing cpu (partial) slabs - now called
get_from_any_partial() to allocate a single object.
So the possibility of allocating remote objects is intended even if
remote accesses are then slower. But the profiles in the report also
suggested a contention on the list_lock spinlock. And that's something
we can try to avoid without much tradeoff - if someone else has the
spin_lock, it's more likely they are allocating from the node than
freeing to it, so we can skip it even if it means allocating a new local
slab - contributing to that lock's contention isn't worth it. It should
not result in partial slabs accumulating on the remote node.
Thus add an allow_spin parameter to __refill_objects_node() and
get_partial_node_bulk() to make the attempts from __refill_objects_any()
use only a trylock.
Reported-by: kernel test robot <oliver.sang@intel.com>
Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
To be applied on top of:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-7.0/sheaves
---
mm/slub.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index eb1f52a79999..ca3db3ae1afb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3378,7 +3378,8 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
static bool get_partial_node_bulk(struct kmem_cache *s,
struct kmem_cache_node *n,
- struct partial_bulk_context *pc)
+ struct partial_bulk_context *pc,
+ bool allow_spin)
{
struct slab *slab, *slab2;
unsigned int total_free = 0;
@@ -3390,7 +3391,10 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
INIT_LIST_HEAD(&pc->slabs);
- spin_lock_irqsave(&n->list_lock, flags);
+ if (allow_spin)
+ spin_lock_irqsave(&n->list_lock, flags);
+ else if (!spin_trylock_irqsave(&n->list_lock, flags))
+ return false;
list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
struct freelist_counters flc;
@@ -6544,7 +6548,8 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
static unsigned int
__refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
- unsigned int max, struct kmem_cache_node *n)
+ unsigned int max, struct kmem_cache_node *n,
+ bool allow_spin)
{
struct partial_bulk_context pc;
struct slab *slab, *slab2;
@@ -6556,7 +6561,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
pc.min_objects = min;
pc.max_objects = max;
- if (!get_partial_node_bulk(s, n, &pc))
+ if (!get_partial_node_bulk(s, n, &pc, allow_spin))
return 0;
list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
@@ -6650,7 +6655,8 @@ __refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min
n->nr_partial <= s->min_partial)
continue;
- r = __refill_objects_node(s, p, gfp, min, max, n);
+ r = __refill_objects_node(s, p, gfp, min, max, n,
+ /* allow_spin = */ false);
refilled += r;
if (r >= min) {
@@ -6691,7 +6697,8 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
return 0;
refilled = __refill_objects_node(s, p, gfp, min, max,
- get_node(s, local_node));
+ get_node(s, local_node),
+ /* allow_spin = */ true);
if (refilled >= min)
return refilled;
---
base-commit: 6f1912181ddfcf851a6670b4fa9c7dfdaf3ed46d
change-id: 20260129-b4-refill_any_trylock-160a31224193
Best regards,
--
Vlastimil Babka <vbabka@suse.cz>
On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote:
> Kernel test robot has reported a regression in the patch "slab: refill
> sheaves from all nodes". When taken in isolation like this, there is
> indeed a tradeoff - we prefer to use remote objects prior to allocating
> new local slabs. It is replicating a behavior that existed before
> sheaves for replenishing cpu (partial) slabs - now called
> get_from_any_partial() to allocate a single object.
>
> So the possibility of allocating remote objects is intended even if
> remote accesses are then slower. But the profiles in the report also
> suggested a contention on the list_lock spinlock. And that's something
> we can try to avoid without much tradeoff - if someone else has the
> spin_lock, it's more likely they are allocating from the node than
> freeing to it, so we can skip it even if it means allocating a new local
> slab - contributing to that lock's contention isn't worth it. It should
> not result in partial slabs accumulating on the remote node.
>
> Thus add an allow_spin parameter to __refill_objects_node() and
> get_partial_node_bulk() to make the attempts from __refill_objects_any()
> use only a trylock.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
In my testing, this patch improved performance by:
will-it-scale.64.processes +14.2%
will-it-scale.128.processes +9.6%
will-it-scale.192.processes +10.8%
will-it-scale.per_process_ops +11.6%
Tested-by: Hao Li <hao.li@linux.dev>
--
Thanks
Hao
> ---
> To be applied on top of:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-7.0/sheaves
> ---
> mm/slub.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index eb1f52a79999..ca3db3ae1afb 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3378,7 +3378,8 @@ static inline bool pfmemalloc_match(struct slab *slab, gfp_t gfpflags);
>
> static bool get_partial_node_bulk(struct kmem_cache *s,
> struct kmem_cache_node *n,
> - struct partial_bulk_context *pc)
> + struct partial_bulk_context *pc,
> + bool allow_spin)
> {
> struct slab *slab, *slab2;
> unsigned int total_free = 0;
> @@ -3390,7 +3391,10 @@ static bool get_partial_node_bulk(struct kmem_cache *s,
>
> INIT_LIST_HEAD(&pc->slabs);
>
> - spin_lock_irqsave(&n->list_lock, flags);
> + if (allow_spin)
> + spin_lock_irqsave(&n->list_lock, flags);
> + else if (!spin_trylock_irqsave(&n->list_lock, flags))
> + return false;
>
> list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
> struct freelist_counters flc;
> @@ -6544,7 +6548,8 @@ EXPORT_SYMBOL(kmem_cache_free_bulk);
>
> static unsigned int
> __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> - unsigned int max, struct kmem_cache_node *n)
> + unsigned int max, struct kmem_cache_node *n,
> + bool allow_spin)
> {
> struct partial_bulk_context pc;
> struct slab *slab, *slab2;
> @@ -6556,7 +6561,7 @@ __refill_objects_node(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int mi
> pc.min_objects = min;
> pc.max_objects = max;
>
> - if (!get_partial_node_bulk(s, n, &pc))
> + if (!get_partial_node_bulk(s, n, &pc, allow_spin))
> return 0;
>
> list_for_each_entry_safe(slab, slab2, &pc.slabs, slab_list) {
> @@ -6650,7 +6655,8 @@ __refill_objects_any(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min
> n->nr_partial <= s->min_partial)
> continue;
>
> - r = __refill_objects_node(s, p, gfp, min, max, n);
> + r = __refill_objects_node(s, p, gfp, min, max, n,
> + /* allow_spin = */ false);
> refilled += r;
>
> if (r >= min) {
> @@ -6691,7 +6697,8 @@ refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
> return 0;
>
> refilled = __refill_objects_node(s, p, gfp, min, max,
> - get_node(s, local_node));
> + get_node(s, local_node),
> + /* allow_spin = */ true);
> if (refilled >= min)
> return refilled;
>
>
> ---
> base-commit: 6f1912181ddfcf851a6670b4fa9c7dfdaf3ed46d
> change-id: 20260129-b4-refill_any_trylock-160a31224193
>
> Best regards,
> --
> Vlastimil Babka <vbabka@suse.cz>
>
On Thu, Jan 29, 2026 at 05:21:21PM +0800, Hao Li wrote: > On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote: > > Kernel test robot has reported a regression in the patch "slab: refill > > sheaves from all nodes". When taken in isolation like this, there is > > indeed a tradeoff - we prefer to use remote objects prior to allocating > > new local slabs. It is replicating a behavior that existed before > > sheaves for replenishing cpu (partial) slabs - now called > > get_from_any_partial() to allocate a single object. > > > > So the possibility of allocating remote objects is intended even if > > remote accesses are then slower. But the profiles in the report also > > suggested a contention on the list_lock spinlock. And that's something > > we can try to avoid without much tradeoff - if someone else has the > > spin_lock, it's more likely they are allocating from the node than > > freeing to it, so we can skip it even if it means allocating a new local > > slab - contributing to that lock's contention isn't worth it. It should > > not result in partial slabs accumulating on the remote node. > > > > Thus add an allow_spin parameter to __refill_objects_node() and > > get_partial_node_bulk() to make the attempts from __refill_objects_any() > > use only a trylock. > > > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > In my testing, this patch improved performance by: > > will-it-scale.64.processes +14.2% > will-it-scale.128.processes +9.6% > will-it-scale.192.processes +10.8% > will-it-scale.per_process_ops +11.6% > > Tested-by: Hao Li <hao.li@linux.dev> I wonder if using spin_is_contended() or spin_is_locked() would be better than trylock by avoiding an atomic operation? -- Cheers, Harry / Hyeonggon
On 1/29/26 10:30, Harry Yoo wrote: > On Thu, Jan 29, 2026 at 05:21:21PM +0800, Hao Li wrote: >> On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote: >> > Kernel test robot has reported a regression in the patch "slab: refill >> > sheaves from all nodes". When taken in isolation like this, there is >> > indeed a tradeoff - we prefer to use remote objects prior to allocating >> > new local slabs. It is replicating a behavior that existed before >> > sheaves for replenishing cpu (partial) slabs - now called >> > get_from_any_partial() to allocate a single object. >> > >> > So the possibility of allocating remote objects is intended even if >> > remote accesses are then slower. But the profiles in the report also >> > suggested a contention on the list_lock spinlock. And that's something >> > we can try to avoid without much tradeoff - if someone else has the >> > spin_lock, it's more likely they are allocating from the node than >> > freeing to it, so we can skip it even if it means allocating a new local >> > slab - contributing to that lock's contention isn't worth it. It should >> > not result in partial slabs accumulating on the remote node. >> > >> > Thus add an allow_spin parameter to __refill_objects_node() and >> > get_partial_node_bulk() to make the attempts from __refill_objects_any() >> > use only a trylock. >> > >> > Reported-by: kernel test robot <oliver.sang@intel.com> >> > Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com >> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >> >> In my testing, this patch improved performance by: >> >> will-it-scale.64.processes +14.2% >> will-it-scale.128.processes +9.6% >> will-it-scale.192.processes +10.8% >> will-it-scale.per_process_ops +11.6% >> >> Tested-by: Hao Li <hao.li@linux.dev> > > I wonder if using spin_is_contended() or spin_is_locked() > would be better than trylock by avoiding an atomic operation? I checked and found that spin_trylock() itself implements a non-atomic check before the atomic. So adding a spin_is_locked() would only help the caller bail out a bit faster, but this is not a fastpath. It wouldn't help the cache coherency traffic, AFAIU.
On Thu, Jan 29, 2026 at 11:39:04AM +0100, Vlastimil Babka wrote: > On 1/29/26 10:30, Harry Yoo wrote: > > On Thu, Jan 29, 2026 at 05:21:21PM +0800, Hao Li wrote: > >> On Thu, Jan 29, 2026 at 10:07:57AM +0100, Vlastimil Babka wrote: > >> > Kernel test robot has reported a regression in the patch "slab: refill > >> > sheaves from all nodes". When taken in isolation like this, there is > >> > indeed a tradeoff - we prefer to use remote objects prior to allocating > >> > new local slabs. It is replicating a behavior that existed before > >> > sheaves for replenishing cpu (partial) slabs - now called > >> > get_from_any_partial() to allocate a single object. > >> > > >> > So the possibility of allocating remote objects is intended even if > >> > remote accesses are then slower. But the profiles in the report also > >> > suggested a contention on the list_lock spinlock. And that's something > >> > we can try to avoid without much tradeoff - if someone else has the > >> > spin_lock, it's more likely they are allocating from the node than > >> > freeing to it, so we can skip it even if it means allocating a new local > >> > slab - contributing to that lock's contention isn't worth it. It should > >> > not result in partial slabs accumulating on the remote node. > >> > > >> > Thus add an allow_spin parameter to __refill_objects_node() and > >> > get_partial_node_bulk() to make the attempts from __refill_objects_any() > >> > use only a trylock. > >> > > >> > Reported-by: kernel test robot <oliver.sang@intel.com> > >> > Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com > >> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > >> > >> In my testing, this patch improved performance by: > >> > >> will-it-scale.64.processes +14.2% > >> will-it-scale.128.processes +9.6% > >> will-it-scale.192.processes +10.8% > >> will-it-scale.per_process_ops +11.6% > >> > >> Tested-by: Hao Li <hao.li@linux.dev> > > > > I wonder if using spin_is_contended() or spin_is_locked() > > would be better than trylock by avoiding an atomic operation? > > I checked and found that spin_trylock() itself implements a non-atomic check > before the atomic. So adding a spin_is_locked() would only help the caller > bail out a bit faster, but this is not a fastpath. It wouldn't help the > cache coherency traffic, AFAIU. I looked at qspinlock version of spin_trylock() and you're right :) I just assumed it will always do a CAS but it's not the case! -- Cheers, Harry / Hyeonggon
© 2016 - 2026 Red Hat, Inc.