Reviving the slab destructor to tackle the percpu allocator scalability problem

[RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Harry Yoo 9 months, 2 weeks ago

Overview
========

The slab destructor feature existed in early days of slab allocator(s).
It was removed by the commit c59def9f222d ("Slab allocators: Drop support
for destructors") in 2007 due to lack of serious use cases at that time.

Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
constructor/destructor pair to mitigate the global serialization point
(pcpu_alloc_mutex) that occurs when each slab object allocates and frees
percpu memory during its lifetime.

Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
so each allocate–free cycle requires two expensive acquire/release on
that mutex.

We can mitigate this contention by retaining the percpu regions after
the object is freed and releasing them only when the backing slab pages
are freed.

How to do this with slab constructors and destructors: the constructor
allocates percpu memory, and the destructor frees it when the slab pages
are reclaimed; this slightly alters the constructor’s semantics,
as it can now fail.

This series is functional (although not compatible with MM debug
features yet), but still far from perfect. I’m actively refining it and
would appreciate early feedback before I improve it further. :)

This series is based on slab/for-next [2].

Performance Improvement
=======================

I measured the benefit of this series for two different users:
exec() and tc filter insertion/removal.

exec() throughput
-----------------

The performance of exec() is important when short-lived processes are
frequently created. For example: shell-heavy workloads and running many
test cases [3].

I measured exec() throughput with a microbenchmark:
  - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
  - 4.56% gain on a desktop with 24 hardware threads, and
  - Even 4% gain on a single-threaded exec() throughput.

Further investigation showed that this was due to the overhead of
acquiring/releasing pcpu_alloc_mutex and its contention.

See patch 7 for more detail on the experiment.

Traffic Filter Insertion and Removal
------------------------------------

Each tc filter allocates three percpu memory regions per tc_action object,
so frequently inserting and removing filters contend heavily on the same
mutex.

In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
more detail), I observed a 26% reduction in system time and observed
much less contention on pcpu_alloc_mutex with this series.

I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
about tc filter insertion rate; these changes may still benefit
workloads they run today.

Feedback Needed from Percpu Allocator Folks
===========================================

As percpu allocator is directly affected by this series, this work
will need support from the percpu allocator maintainers, and we need to
address their concerns.

They will probably say "This is a percpu memory allocator scalability
issue and we need to make it scalable"? I don't know.

What do you say?

Some hanging thoughts:
- Tackling the problem on the slab side is much simpler, because the slab
  allocator already caches objects per CPU. Re-creating similar logic
  inside the percpu allocator would be redundant.

  Also, since this is opt-in per slab cache, other percpu allocator
  users remain unaffected.

- If fragmentation is a concern, we could probably allocate larger percpu
  chunks and partition them for slab objects.

- If memory overhead becomes an issue, we could introduce a shrinker
  to free empty slabs (and thus releasing underlying percpu memory chunks).

Patch Sequence
==============

Patch #1 refactors freelist_shuffle() to allow the slab constructor to
fail in the next patch.

Patch #2 allows the slab constructor fail.

Patch #3 implements the slab destructor feature.

Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair. 

Patch #5, #6 implements APIs to charge and uncharge percpu memory and
percpu counter.

Patch #7 converts mm_struct to use the slab ctor/dtor pair. 

Known issues with MM debug features
===================================

The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
and DEBUG_OBJECTS.

KASAN reports an error when a percpu counter is inserted into the
percpu_counters linked list because the counter has not been allocated
yet.

DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
the associated percpu memory is still resident in memory.

I don't expect fixing these issues to be too difficult, but I need to
think a little bit to fix it.

[1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com

[2] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next

[3] https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3

[4] https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com

Harry Yoo (7):
  mm/slab: refactor freelist shuffle
  treewide, slab: allow slab constructor to return an error
  mm/slab: revive the destructor feature in slab allocator
  net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
    alloc
  mm/percpu: allow (un)charging objects without alloc/free
  lib/percpu_counter: allow (un)charging percpu counters without
    alloc/free
  kernel/fork: improve exec() throughput with slab ctor/dtor pair

 arch/powerpc/include/asm/svm.h            |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
 arch/powerpc/mm/init-common.c             |   3 +-
 arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
 arch/powerpc/platforms/pseries/setup.c    |   2 +-
 arch/powerpc/platforms/pseries/svm.c      |   4 +-
 arch/sh/mm/pgtable.c                      |   3 +-
 arch/sparc/mm/tsb.c                       |   8 +-
 block/bdev.c                              |   3 +-
 drivers/dax/super.c                       |   3 +-
 drivers/gpu/drm/i915/i915_request.c       |   3 +-
 drivers/misc/lkdtm/heap.c                 |  12 +--
 drivers/usb/mon/mon_text.c                |   5 +-
 fs/9p/v9fs.c                              |   3 +-
 fs/adfs/super.c                           |   3 +-
 fs/affs/super.c                           |   3 +-
 fs/afs/super.c                            |   5 +-
 fs/befs/linuxvfs.c                        |   3 +-
 fs/bfs/inode.c                            |   3 +-
 fs/btrfs/inode.c                          |   3 +-
 fs/ceph/super.c                           |   3 +-
 fs/coda/inode.c                           |   3 +-
 fs/debugfs/inode.c                        |   3 +-
 fs/dlm/lowcomms.c                         |   3 +-
 fs/ecryptfs/main.c                        |   5 +-
 fs/efs/super.c                            |   3 +-
 fs/erofs/super.c                          |   3 +-
 fs/exfat/cache.c                          |   3 +-
 fs/exfat/super.c                          |   3 +-
 fs/ext2/super.c                           |   3 +-
 fs/ext4/super.c                           |   3 +-
 fs/fat/cache.c                            |   3 +-
 fs/fat/inode.c                            |   3 +-
 fs/fuse/inode.c                           |   3 +-
 fs/gfs2/main.c                            |   9 +-
 fs/hfs/super.c                            |   3 +-
 fs/hfsplus/super.c                        |   3 +-
 fs/hpfs/super.c                           |   3 +-
 fs/hugetlbfs/inode.c                      |   3 +-
 fs/inode.c                                |   3 +-
 fs/isofs/inode.c                          |   3 +-
 fs/jffs2/super.c                          |   3 +-
 fs/jfs/super.c                            |   3 +-
 fs/minix/inode.c                          |   3 +-
 fs/nfs/inode.c                            |   3 +-
 fs/nfs/nfs42xattr.c                       |   3 +-
 fs/nilfs2/super.c                         |   6 +-
 fs/ntfs3/super.c                          |   3 +-
 fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
 fs/ocfs2/super.c                          |   3 +-
 fs/openpromfs/inode.c                     |   3 +-
 fs/orangefs/super.c                       |   3 +-
 fs/overlayfs/super.c                      |   3 +-
 fs/pidfs.c                                |   3 +-
 fs/proc/inode.c                           |   3 +-
 fs/qnx4/inode.c                           |   3 +-
 fs/qnx6/inode.c                           |   3 +-
 fs/romfs/super.c                          |   3 +-
 fs/smb/client/cifsfs.c                    |   3 +-
 fs/squashfs/super.c                       |   3 +-
 fs/tracefs/inode.c                        |   3 +-
 fs/ubifs/super.c                          |   3 +-
 fs/udf/super.c                            |   3 +-
 fs/ufs/super.c                            |   3 +-
 fs/userfaultfd.c                          |   3 +-
 fs/vboxsf/super.c                         |   3 +-
 fs/xfs/xfs_super.c                        |   3 +-
 include/linux/mm_types.h                  |  40 ++++++---
 include/linux/percpu.h                    |  10 +++
 include/linux/percpu_counter.h            |   2 +
 include/linux/slab.h                      |  21 +++--
 ipc/mqueue.c                              |   3 +-
 kernel/fork.c                             |  65 +++++++++-----
 kernel/rcu/refscale.c                     |   3 +-
 lib/percpu_counter.c                      |  25 ++++++
 lib/radix-tree.c                          |   3 +-
 lib/test_meminit.c                        |   3 +-
 mm/kfence/kfence_test.c                   |   5 +-
 mm/percpu.c                               |  79 ++++++++++------
 mm/rmap.c                                 |   3 +-
 mm/shmem.c                                |   3 +-
 mm/slab.h                                 |  11 +--
 mm/slab_common.c                          |  43 +++++----
 mm/slub.c                                 | 105 ++++++++++++++++------
 net/sched/act_api.c                       |  82 +++++++++++------
 net/socket.c                              |   3 +-
 net/sunrpc/rpc_pipe.c                     |   3 +-
 security/integrity/ima/ima_iint.c         |   3 +-
 88 files changed, 518 insertions(+), 226 deletions(-)

-- 
2.43.0

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Tejun Heo 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
...
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
> 
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.

When percpu allocator was first introduced, the use cases were a lot more
limited, so the single mutex and expensive alloc/free paths weren't a
problem. We keep using percpu memory for more and more things, and I always
thought we'd eventually need a more sophisticated allocator something with
object caching. I don't exactly know what that should look like but maybe a
simplified version of sl*b serving power of two sizes should do or maybe it
needs to be smaller and more adaptive. We'd need to collect some data to
decide which way to go.

Improving percpu allocator in general is obviously a heavier lift but that
may be a better long-term direction.

Thanks.

-- 
tejun

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Harry Yoo 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 08:47:05AM -1000, Tejun Heo wrote:
> On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> ...
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> > 
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> 
> When percpu allocator was first introduced, the use cases were a lot more
> limited, so the single mutex and expensive alloc/free paths weren't a
> problem. We keep using percpu memory for more and more things, and I always
> thought we'd eventually need a more sophisticated allocator something with
> object caching.

Yeah, when you first write an allocator, it's an overkill to make it
too scalable. But over time, as with other allocators, more users show up
that require a more sophisticated allocator.

> I don't exactly know what that should look like but maybe a
> simplified version of sl*b serving power of two sizes should do or maybe it
> needs to be smaller and more adaptive. We'd need to collect some data to
> decide which way to go.

I'm not sure what kind of data we need — maybe allocation size distributions,
or more profiling data on workloads that contend on percpu allocator's locks?

> Improving percpu allocator in general is obviously a heavier lift but that
> may be a better long-term direction.

Yeah, if that's doable. But until then I think it still makes sense to cache
it within slab objects, ...or probably even after improving the percpu
allocator? It's a still churn that's incurred during each object's lifetime
regardless. (Need some data to see if justifiable)

And, as Mateusz explained, the percpu allocator isn’t the only motivation
for the ctor/dtor pair. Other expensive serializations like pgd_lock and
percpu_counters_lock are other motivations to do this.

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Tejun Heo 9 months, 2 weeks ago

On Fri, Apr 25, 2025 at 07:10:03PM +0900, Harry Yoo wrote:
> > I don't exactly know what that should look like but maybe a
> > simplified version of sl*b serving power of two sizes should do or maybe it
> > needs to be smaller and more adaptive. We'd need to collect some data to
> > decide which way to go.
> 
> I'm not sure what kind of data we need — maybe allocation size distributions,
> or more profiling data on workloads that contend on percpu allocator's locks?

Oh yeah, mostly distributions of memory allocation sizes across different
systems and workloads.

Thanks.

-- 
tejun

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Christoph Lameter (Ampere) 9 months, 2 weeks ago

On Thu, 24 Apr 2025, Harry Yoo wrote:

> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.

> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.

Could you keep a cache of recently used per cpu regions so that you can
avoid frequent percpu allocation operation?

You could allocate larger percpu areas for a batch of them and
then assign as needed.

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Mateusz Guzik 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 5:50 PM Christoph Lameter (Ampere)
<cl@gentwo.org> wrote:
>
> On Thu, 24 Apr 2025, Harry Yoo wrote:
>
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
>
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
>
> Could you keep a cache of recently used per cpu regions so that you can
> avoid frequent percpu allocation operation?
>
> You could allocate larger percpu areas for a batch of them and
> then assign as needed.

I was considering a mechanism like that earlier, but the changes
needed to make it happen would result in worse state for the
alloc/free path.

RSS counters are embedded into mm with only the per-cpu areas being a
pointer. The machinery maintains a global list of all of their
instances, i.e. the pointers to internal to mm_struct. That is to say
even if you deserialized allocation of percpu memory itself, you would
still globally serialize on adding/removing the counters to the global
list.

But suppose this got reworked somehow and this bit ceases to be a problem.

Another spot where mm alloc/free globally serializes (at least on
x86_64) is pgd_alloc/free on the global pgd_lock.

Suppose you managed to decompose the lock into a finer granularity, to
the point where it does not pose a problem from contention standpoint.
Even then that's work which does not have to happen there.

General theme is there is a lot of expensive work happening when
dealing with mm lifecycle (*both* from single- and multi-threaded
standpoint) and preferably it would only be dealt with once per
object's existence.
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Christoph Lameter (Ampere) 9 months, 2 weeks ago

On Thu, 24 Apr 2025, Mateusz Guzik wrote:

> > You could allocate larger percpu areas for a batch of them and
> > then assign as needed.
>
> I was considering a mechanism like that earlier, but the changes
> needed to make it happen would result in worse state for the
> alloc/free path.
>
> RSS counters are embedded into mm with only the per-cpu areas being a
> pointer. The machinery maintains a global list of all of their
> instances, i.e. the pointers to internal to mm_struct. That is to say
> even if you deserialized allocation of percpu memory itself, you would
> still globally serialize on adding/removing the counters to the global
> list.
>
> But suppose this got reworked somehow and this bit ceases to be a problem.
>
> Another spot where mm alloc/free globally serializes (at least on
> x86_64) is pgd_alloc/free on the global pgd_lock.
>
> Suppose you managed to decompose the lock into a finer granularity, to
> the point where it does not pose a problem from contention standpoint.
> Even then that's work which does not have to happen there.
>
> General theme is there is a lot of expensive work happening when
> dealing with mm lifecycle (*both* from single- and multi-threaded
> standpoint) and preferably it would only be dealt with once per
> object's existence.

Maybe change the lifecyle? Allocate a batch nr of entries initially from
the slab allocator and use them for multiple mm_structs as the need
arises.

Do not free them to the slab allocator until you
have too many that do nothing around?

You may also want to avoid counter updates with this scheme if you only
count the batchees useed. It will become a bit fuzzy but you improve scalability.

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Mateusz Guzik 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 6:39 PM Christoph Lameter (Ampere)
<cl@gentwo.org> wrote:
>
> On Thu, 24 Apr 2025, Mateusz Guzik wrote:
>
> > > You could allocate larger percpu areas for a batch of them and
> > > then assign as needed.
> >
> > I was considering a mechanism like that earlier, but the changes
> > needed to make it happen would result in worse state for the
> > alloc/free path.
> >
> > RSS counters are embedded into mm with only the per-cpu areas being a
> > pointer. The machinery maintains a global list of all of their
> > instances, i.e. the pointers to internal to mm_struct. That is to say
> > even if you deserialized allocation of percpu memory itself, you would
> > still globally serialize on adding/removing the counters to the global
> > list.
> >
> > But suppose this got reworked somehow and this bit ceases to be a problem.
> >
> > Another spot where mm alloc/free globally serializes (at least on
> > x86_64) is pgd_alloc/free on the global pgd_lock.
> >
> > Suppose you managed to decompose the lock into a finer granularity, to
> > the point where it does not pose a problem from contention standpoint.
> > Even then that's work which does not have to happen there.
> >
> > General theme is there is a lot of expensive work happening when
> > dealing with mm lifecycle (*both* from single- and multi-threaded
> > standpoint) and preferably it would only be dealt with once per
> > object's existence.
>
> Maybe change the lifecyle? Allocate a batch nr of entries initially from
> the slab allocator and use them for multiple mm_structs as the need
> arises.
>
> Do not free them to the slab allocator until you
> have too many that do nothing around?
>
> You may also want to avoid counter updates with this scheme if you only
> count the batchees useed. It will become a bit fuzzy but you improve scalability.
>

If I get this right this proposal boils down to caching all the state,
but hiding the objects from reclaim?

If going this kind of route, perhaps it would be simpler to prevent
direct reclaim on mm objs and instead if there is memory shortage, let
a different thread take care of them?
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Pedro Falcato 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> Overview
> ========
> 
> The slab destructor feature existed in early days of slab allocator(s).
> It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> for destructors") in 2007 due to lack of serious use cases at that time.
> 
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
> 
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.
> 
> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.
> 
> How to do this with slab constructors and destructors: the constructor
> allocates percpu memory, and the destructor frees it when the slab pages
> are reclaimed; this slightly alters the constructor’s semantics,
> as it can now fail.
> 

I really really really really don't like this. We're opening a pandora's box
of locking issues for slab deadlocks and other subtle issues. IMO the best
solution there would be, what, failing dtors? which says a lot about the whole
situation...

Case in point:
What happens if you allocate a slab and start ->ctor()-ing objects, and then
one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
everything back (AIUI this is not handled in this series, yet). Besides this
complication, if failing dtors were added into the mix, we'd be left with a
half-initialized slab(!!) in the middle of the cache waiting to get freed,
without being able to.

Then there are obviously other problems like: whatever you're calling must
not ever require the slab allocator (directly or indirectly) and must not
do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
is a no-go (AIUI!) already because of such issues.

Then there's the separate (but adjacent, particularly as we're considering
this series due to performance improvements) issue that the ctor() and
dtor() interfaces are terrible, in the sense that they do not let you batch
in any way shape or form (requiring us to lock/unlock many times, allocate
many times, etc). If this is done for performance improvements, I would prefer
a superior ctor/dtor interface that takes something like a slab iterator and
lets you do these things.

The ghost of 1992 Solaris still haunts us...

> This series is functional (although not compatible with MM debug
> features yet), but still far from perfect. I’m actively refining it and
> would appreciate early feedback before I improve it further. :)
> 
> This series is based on slab/for-next [2].
> 
> Performance Improvement
> =======================
> 
> I measured the benefit of this series for two different users:
> exec() and tc filter insertion/removal.
> 
> exec() throughput
> -----------------
> 
> The performance of exec() is important when short-lived processes are
> frequently created. For example: shell-heavy workloads and running many
> test cases [3].
> 
> I measured exec() throughput with a microbenchmark:
>   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
>   - 4.56% gain on a desktop with 24 hardware threads, and
>   - Even 4% gain on a single-threaded exec() throughput.
> 
> Further investigation showed that this was due to the overhead of
> acquiring/releasing pcpu_alloc_mutex and its contention.
> 
> See patch 7 for more detail on the experiment.
> 
> Traffic Filter Insertion and Removal
> ------------------------------------
> 
> Each tc filter allocates three percpu memory regions per tc_action object,
> so frequently inserting and removing filters contend heavily on the same
> mutex.
> 
> In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> more detail), I observed a 26% reduction in system time and observed
> much less contention on pcpu_alloc_mutex with this series.
> 
> I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> about tc filter insertion rate; these changes may still benefit
> workloads they run today.
> 

The performance improvements are obviously fantastic, but I do wonder
if things could be fixed by just fixing the underlying problems, instead
of tapering over them with slab allocator magic and dubious object lifecycles.

In this case, the big issue is that the pcpu allocator does not scale well.

-- 
Pedro

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Harry Yoo 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > Overview
> > ========
> > 
> > The slab destructor feature existed in early days of slab allocator(s).
> > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > for destructors") in 2007 due to lack of serious use cases at that time.
> > 
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> > 
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> > 
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
> > 
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> > 
> 
> I really really really really don't like this. We're opening a pandora's box
> of locking issues for slab deadlocks and other subtle issues. IMO the best
> solution there would be, what, failing dtors? which says a lot about the whole
> situation...
> 
> Case in point:

<...snip...>

> Then there are obviously other problems like: whatever you're calling must
> not ever require the slab allocator (directly or indirectly) and must not
> do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> is a no-go (AIUI!) already because of such issues.

Could you please elaborate more on this?

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Pedro Falcato 9 months, 2 weeks ago

On Fri, Apr 25, 2025 at 07:12:02PM +0900, Harry Yoo wrote:
> On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> > On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > > Overview
> > > ========
> > > 
> > > The slab destructor feature existed in early days of slab allocator(s).
> > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > for destructors") in 2007 due to lack of serious use cases at that time.
> > > 
> > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > constructor/destructor pair to mitigate the global serialization point
> > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > percpu memory during its lifetime.
> > > 
> > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > so each allocate–free cycle requires two expensive acquire/release on
> > > that mutex.
> > > 
> > > We can mitigate this contention by retaining the percpu regions after
> > > the object is freed and releasing them only when the backing slab pages
> > > are freed.
> > > 
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > > 
> > 
> > I really really really really don't like this. We're opening a pandora's box
> > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > solution there would be, what, failing dtors? which says a lot about the whole
> > situation...
> > 
> > Case in point:
> 
> <...snip...>
> 
> > Then there are obviously other problems like: whatever you're calling must
> > not ever require the slab allocator (directly or indirectly) and must not
> > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > is a no-go (AIUI!) already because of such issues.
> 
> Could you please elaborate more on this?

Well, as discussed multiple-times both on-and-off-list, the pcpu allocator is
not a problem here because the freeing path takes a spinlock, not a mutex. But
obviously you can see the fun locking horror dependency chains we're creating
with this patchset. ->ctor() needs to be super careful calling things, avoiding
any sort of loop. ->dtor() needs to be super careful calling things, avoiding
_any_ sort of direct reclaim possibilities. You also now need to pass a gfp_t
to both ->ctor and ->dtor.

With regards to "leaf locks", I still don't really understand what you/Mateusz
mean or how that's even enforceable from the get-go.

So basically:
- ->ctor takes more args, can fail, can do fancier things (multiple allocations,
  lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
  caches become GFP_ATOMIC-incompatible)

- ->dtor *will* do fancy things like recursing back onto the slab allocator and
  grabbing locks

- a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
  to dispose of slabs. It can also uncontrollably do $whatever.

- a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
  ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
  cache_free/slab disposal issues

- a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
  issues by purely starting up shrinkers on direct reclaim as well.

- the whole original "Slab object caching allocator" idea from 1992 is extremely
  confusing and works super poorly with various debugging features (like, e.g,
  KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
  TYPESAFE_BY_RCU, that we *really* need.

These are basically my issues with the whole idea. I highly disagree that we should
open this pandora's box for problems in *other places*.

-- 
Pedro

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Mateusz Guzik 9 months, 1 week ago

On Fri, Apr 25, 2025 at 12:42 PM Pedro Falcato <pfalcato@suse.de> wrote:
> With regards to "leaf locks", I still don't really understand what you/Mateusz
> mean or how that's even enforceable from the get-go.
>
> So basically:
> - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
>   lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
>   caches become GFP_ATOMIC-incompatible)
>
> - ->dtor *will* do fancy things like recursing back onto the slab allocator and
>   grabbing locks
>
> - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
>   to dispose of slabs. It can also uncontrollably do $whatever.
>
> - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
>   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
>   cache_free/slab disposal issues
>
> - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
>   issues by purely starting up shrinkers on direct reclaim as well.
>
> - the whole original "Slab object caching allocator" idea from 1992 is extremely
>   confusing and works super poorly with various debugging features (like, e.g,
>   KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
>   TYPESAFE_BY_RCU, that we *really* need.
>
> These are basically my issues with the whole idea. I highly disagree that we should
> open this pandora's box for problems in *other places*.

Apologies for the late reply.

Looks like your primary apprehension concerns dtor, I'm going to
address it below.

But first a quick remark that the headline here is expensive
single-threaded work which keeps happening on every mm alloc/free and
which does not have to, extending beyond the percpu allocator
problems. Having a memory allocator which can handle it would be most
welcome.

Now to business:
I'm going to start with pointing out that dtors callable from any
place are not an *inherent* requirement of the idea. Given that
apparently sheaves don't do direct reclaim and that Christoph's idea
does not do it either, I think there is some support for objs with
unsafe dtors *not* being direct reclaimable (instead there can be a
dedicated workqueue or some other mechanism sorting them out). I did
not realize something like this would be considered fine. It is the
easiest way out and is perfectly fine with me.

However, suppose objs with dtors do need to be reclaimable the usual way.

I claim that writing dtors which are safe to use in that context is
not a significant challenge. Moreover, it is also possible to extend
lockdep to validate correct behavior. Finally, test code can trigger
ctor and dtor calls for all slabs to execute all this code at least
once with lockdep enabled. So while *honest* mistakes with locking are
very much possible, they will be trivially caught and I don't believe
the box which is being opened here belongs to Pandora.

So here is another attempt at explaning leaf spinlocks.

Suppose you have a global lock named "crapper". Further suppose the
lock is only taken with _irqsave *and* no locks are taken while
holding it.

Say this is the only consumer:
void crapperbump(void) {
    int flags;
    spin_lock_irqsave(&crapper, flags);
    mehvar++;
    spin_unlock_irqsave(&crapper);
}

Perhaps you can agree *anyone* can call here at any point and not risk
deadlocking.

That's an example of a leaf lock.

Aight, so how does one combat cases where the code turns into:
spin_lock_irqsave(&crapper, flags);
spin_lock_irqsave(&meh, flags2);

In this case "crapper" is no longer a leaf lock and in principle there
may be lock ordering involving "meh" which does deadlock.

Here is an example way out: when initializing the "crapper" lock, it
gets marked as a leaf lock so that lockdep can check for it. Then on a
lockdep-enabled kernel you get a splat when executing the routine when
you get to locking "meh". This sorts out the ctor side of things.

How does one validate dtor? lockdep can have a "only leaf-locks
allowed in this area" tunable around calls to dtors. Then should a
dtor have ideas of acquiring a lock which is not a leaf lock you are
once more going to get a splat.

And of course you can just force call all ctors and dtors on a debug
kernel (no need to trigger any memory pressure, just walk the list of
slabs with ctor + dtor pairs and call the stuff).

This of course would require some effort to implement, but there is no
rocket science degree required.

However, given that direct reclaim for mm is apparently not a strict
requirement, my preferred way out is to simply not provide it.
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Harry Yoo 9 months ago

On Wed, Apr 30, 2025 at 09:49:02PM +0200, Mateusz Guzik wrote:
> On Fri, Apr 25, 2025 at 12:42 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > With regards to "leaf locks", I still don't really understand what you/Mateusz
> > mean or how that's even enforceable from the get-go.
> >
> > So basically:
> > - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
> >   lock holding, etc, can be hidden with a normal kmem_cache_alloc; certain
> >   caches become GFP_ATOMIC-incompatible)
> >
> > - ->dtor *will* do fancy things like recursing back onto the slab allocator and
> >   grabbing locks
> >
> > - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
> >   to dispose of slabs. It can also uncontrollably do $whatever.
> >
> > - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
> >   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
> >   cache_free/slab disposal issues
> >
> > - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
> >   issues by purely starting up shrinkers on direct reclaim as well.
> >
> > - the whole original "Slab object caching allocator" idea from 1992 is extremely
> >   confusing and works super poorly with various debugging features (like, e.g,
> >   KASAN). IMO it should really be reserved (in a limited capacity!) for stuff like
> >   TYPESAFE_BY_RCU, that we *really* need.
> >
> > These are basically my issues with the whole idea. I highly disagree that we should
> > open this pandora's box for problems in *other places*.
> 
> Now to business:
> I'm going to start with pointing out that dtors callable from any
> place are not an *inherent* requirement of the idea. Given that
> apparently sheaves don't do direct reclaim and that Christoph's idea
> does not do it either, I think there is some support for objs with
> unsafe dtors *not* being direct reclaimable (instead there can be a
> dedicated workqueue or some other mechanism sorting them out). I did
> not realize something like this would be considered fine. It is the
> easiest way out and is perfectly fine with me.

My concern is that this prevents reclaimable slabs from using a
destructure in the future. Probably doesn't matter now, but it
doesn't seem future-proof.

> However, suppose objs with dtors do need to be reclaimable the usual way.
> 
> I claim that writing dtors which are safe to use in that context is
> not a significant challenge. Moreover, it is also possible to extend
> lockdep to validate correct behavior. Finally, test code can trigger
> ctor and dtor calls for all slabs to execute all this code at least
> once with lockdep enabled. So while *honest* mistakes with locking are
> very much possible, they will be trivially caught and I don't believe
> the box which is being opened here belongs to Pandora.
> 
> So here is another attempt at explaning leaf spinlocks.
> 
> Suppose you have a global lock named "crapper". Further suppose the
> lock is only taken with _irqsave *and* no locks are taken while
> holding it.
> 
> Say this is the only consumer:
> void crapperbump(void) {
>     int flags;
>     spin_lock_irqsave(&crapper, flags);
>     mehvar++;
>     spin_unlock_irqsave(&crapper);
> }
> 
> Perhaps you can agree *anyone* can call here at any point and not risk
> deadlocking.
> 
> That's an example of a leaf lock.
> 
> Aight, so how does one combat cases where the code turns into:
> spin_lock_irqsave(&crapper, flags);
> spin_lock_irqsave(&meh, flags2);
> 
> In this case "crapper" is no longer a leaf lock and in principle there
> may be lock ordering involving "meh" which does deadlock.
> 
> Here is an example way out: when initializing the "crapper" lock, it
> gets marked as a leaf lock so that lockdep can check for it. Then on a
> lockdep-enabled kernel you get a splat when executing the routine when
> you get to locking "meh". This sorts out the ctor side of things.
> 
> How does one validate dtor? lockdep can have a "only leaf-locks
> allowed in this area" tunable around calls to dtors. Then should a
> dtor have ideas of acquiring a lock which is not a leaf lock you are
> once more going to get a splat.

Similar to the concern above, this would limit how users can use the
destructor feature (let's say percpu violated the rule, altough it
doesn't appear to, then we would have to modify percpu allocator to
conform to it).

So I think the most future-proof approach (as Vlastimil suggested
off-list) would be to reclaim slab memory and call dtors in a separate
context (e.g., dedicated workqueues as you mentioned)?

But before doing this I need to check if it's okay to increase the nr of
reclaimed pages by the current task, when we know it'll be reclaimed but
not reclaimed yet.

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Harry Yoo 9 months, 2 weeks ago

On Fri, Apr 25, 2025 at 11:42:27AM +0100, Pedro Falcato wrote:
> On Fri, Apr 25, 2025 at 07:12:02PM +0900, Harry Yoo wrote:
> > On Thu, Apr 24, 2025 at 12:28:37PM +0100, Pedro Falcato wrote:
> > > On Thu, Apr 24, 2025 at 05:07:48PM +0900, Harry Yoo wrote:
> > > > Overview
> > > > ========
> > > > 
> > > > The slab destructor feature existed in early days of slab allocator(s).
> > > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > > for destructors") in 2007 due to lack of serious use cases at that time.
> > > > 
> > > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > > constructor/destructor pair to mitigate the global serialization point
> > > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > > percpu memory during its lifetime.
> > > > 
> > > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > > so each allocate–free cycle requires two expensive acquire/release on
> > > > that mutex.
> > > > 
> > > > We can mitigate this contention by retaining the percpu regions after
> > > > the object is freed and releasing them only when the backing slab pages
> > > > are freed.
> > > > 
> > > > How to do this with slab constructors and destructors: the constructor
> > > > allocates percpu memory, and the destructor frees it when the slab pages
> > > > are reclaimed; this slightly alters the constructor’s semantics,
> > > > as it can now fail.
> > > > 
> > > 
> > > I really really really really don't like this. We're opening a pandora's box
> > > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > > solution there would be, what, failing dtors? which says a lot about the whole
> > > situation...
> > > 
> > > Case in point:
> > 
> > <...snip...>
> > 
> > > Then there are obviously other problems like: whatever you're calling must
> > > not ever require the slab allocator (directly or indirectly) and must not
> > > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > > is a no-go (AIUI!) already because of such issues.
> > 
> > Could you please elaborate more on this?
>
> Well, as discussed multiple-times both on-and-off-list, the pcpu allocator is
> not a problem here because the freeing path takes a spinlock, not a mutex.

Yes, and it seems to be a leaf spinlock (no code path in the kernel takes
any other locks while holding the lock).

> But obviously you can see the fun locking horror dependency chains
> we're creating with this patchset.
>
> ->ctor() needs to be super careful calling things, avoiding
> any sort of loop.

You mean recursion to avoid exhausting the kernel stack?

It'd be fine as long as ->ctor does not allocate objects from
the same cache (and of course it should not).

> ->dtor() needs to be super careful calling things, avoiding
> _any_ sort of direct reclaim possibilities.

Why would a dtor _ever_ need to perform direct reclamation?

>You also now need to pass a gfp_t  to both ->ctor and ->dtor.

Passing gfp_t to ->ctor agreed, but why to ->dtor?

Destructors should not allocate any memory and thus no need for
'Get Free Page' flags.

> So basically:
> - ->ctor takes more args, can fail, can do fancier things (multiple allocations,
>   lock holding, etc, can be hidden with a normal kmem_cache_alloc;

Speaking of deadlocks involving ->ctor, of course, you shouldn't allocate
mm_struct while holding pcpu_lock, or allocate a pgd while holding pgd_lock.
I don't think avoiding deadlocks caused by ->ctor is that difficult, and
lockdep can detect them even if someone makes a mistake.

> - ->dtor *will* do fancy things like recursing back onto the slab allocator and
>   grabbing locks

AIUI it can't recurse back onto the slab allocator.
'Taking only leaf spinlocks' is a very restrictive rule.

For example, slab takes list_lock, and it is not a leaf spinlock because
in some path slab can take other locks while holding it. And thus you can't
call kmem_cache_free() directly in ->dtor.

'Doing fancy things and grabbing locks' in dtor is safe as long as
you 1) disable interrupts and 2) take only leaf spinlocks.
 
> - a normal kmem_cache_free can suddenly attempt to grab !SLUB locks as it tries
>   to dispose of slabs. It can also uncontrollably do $whatever.

'Can uncontrollably do $whatever' is not true.
It's safe as long as ->dtor only takes leaf spinlocks.

> - a normal kmem_cache_alloc can call vast swaths of code, uncontrollably, due to
>   ->ctor. It can also set off direct reclaim, and thus run into all sorts of kmem_
>   cache_free/slab disposal issues
> 
> - a normal, possibly-unrelated GFP_KERNEL allocation can also run into all of these
>   issues by purely starting up shrinkers on direct reclaim as well.

I don't see that is a problem (as long as ->dtor only takes leaf spinlocks).

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Mateusz Guzik 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 1:28 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> >
>
> I really really really really don't like this. We're opening a pandora's box
> of locking issues for slab deadlocks and other subtle issues. IMO the best
> solution there would be, what, failing dtors? which says a lot about the whole
> situation...
>

I noted the need to use leaf spin locks in my IRC conversations with
Harry and later in this very thread, it is a bummer this bit did not
make into the cover letter -- hopefully it would have avoided this
exchange.

I'm going to summarize this again here:

By API contract the dtor can only take a leaf spinlock, in this case one which:
1. disables irqs
2. is the last lock in the dependency chain, as in no locks are taken
while holding it

That way there is no possibility of a deadlock.

This poses a question on how to enforce it and this bit is easy: for
example one can add leaf-spinlock notion to lockdep. Then a misuse on
allocation side is going to complain immediately even without
triggering reclaim. Further, if one would feel so inclined, a test
module can walk the list of all slab caches and do a populate/reclaim
cycle on those with the ctor/dtor pair.

Then there is the matter of particular consumers being ready to do
what they need to on the dtor side only with the spinlock. Does not
sound like a fundamental problem.

> Case in point:
> What happens if you allocate a slab and start ->ctor()-ing objects, and then
> one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
> everything back (AIUI this is not handled in this series, yet). Besides this
> complication, if failing dtors were added into the mix, we'd be left with a
> half-initialized slab(!!) in the middle of the cache waiting to get freed,
> without being able to.
>

Per my previous paragraph failing dtors would be a self-induced problem.

I can agree one has to roll things back if ctors don't work out, but I
don't think this poses a significant problem.

> Then there are obviously other problems like: whatever you're calling must
> not ever require the slab allocator (directly or indirectly) and must not
> do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> is a no-go (AIUI!) already because of such issues.
>

I don't see how that's true.

> Then there's the separate (but adjacent, particularly as we're considering
> this series due to performance improvements) issue that the ctor() and
> dtor() interfaces are terrible, in the sense that they do not let you batch
> in any way shape or form (requiring us to lock/unlock many times, allocate
> many times, etc). If this is done for performance improvements, I would prefer
> a superior ctor/dtor interface that takes something like a slab iterator and
> lets you do these things.
>

Batching this is also something I mentioned and indeed is a "nice to
have" change. Note however that the work you are suggesting to batch
now also on every alloc/free cycle, so doing it once per creation of a
given object instead is already a win.

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Harry Yoo 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 05:20:59PM +0200, Mateusz Guzik wrote:
> On Thu, Apr 24, 2025 at 1:28 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> >
> > I really really really really don't like this. We're opening a pandora's box
> > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > solution there would be, what, failing dtors? which says a lot about the whole
> > situation...
> >
> 
> I noted the need to use leaf spin locks in my IRC conversations with
> Harry and later in this very thread, it is a bummer this bit did not
> make into the cover letter -- hopefully it would have avoided this
> exchange.

My bad. Yes, I should have included it in the cover letter.
Will include in the future series.

> I'm going to summarize this again here:
> 
> By API contract the dtor can only take a leaf spinlock, in this case one which:
> 1. disables irqs

Alright, as interrupt handlers can also introduce lock dependency.

> 2. is the last lock in the dependency chain, as in no locks are taken
> while holding it

So if the destructor takes a lock, no users of the lock, in any
dependency chain, can hold any locks while holding it.
 
> That way there is no possibility of a deadlock.
> 
> This poses a question on how to enforce it and this bit is easy: for
> example one can add leaf-spinlock notion to lockdep. Then a misuse on
> allocation side is going to complain immediately even without
> triggering reclaim. Further, if one would feel so inclined, a test
> module can walk the list of all slab caches and do a populate/reclaim
> cycle on those with the ctor/dtor pair.
> 
> Then there is the matter of particular consumers being ready to do
> what they need to on the dtor side only with the spinlock. Does not
> sound like a fundamental problem.
> 
> > Case in point:
> > What happens if you allocate a slab and start ->ctor()-ing objects, and then
> > one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
> > everything back (AIUI this is not handled in this series, yet).

It is handled in __free_slab().

>> Besides this
> > complication, if failing dtors were added into the mix, we'd be left with a
> > half-initialized slab(!!) in the middle of the cache waiting to get freed,
> > without being able to.
> >
> 
> Per my previous paragraph failing dtors would be a self-induced problem.
> 
> I can agree one has to roll things back if ctors don't work out, but I
> don't think this poses a significant problem.

Agreed. it'd better to write destructors carefully avoiding deadlocks
rather than allowing it to fail.

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Mateusz Guzik 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 5:20 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Thu, Apr 24, 2025 at 1:28 PM Pedro Falcato <pfalcato@suse.de> wrote:
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > >
> >
> > I really really really really don't like this. We're opening a pandora's box
> > of locking issues for slab deadlocks and other subtle issues. IMO the best
> > solution there would be, what, failing dtors? which says a lot about the whole
> > situation...
> >
>
> I noted the need to use leaf spin locks in my IRC conversations with
> Harry and later in this very thread, it is a bummer this bit did not
> make into the cover letter -- hopefully it would have avoided this
> exchange.
>
> I'm going to summarize this again here:
>
> By API contract the dtor can only take a leaf spinlock, in this case one which:
> 1. disables irqs
> 2. is the last lock in the dependency chain, as in no locks are taken
> while holding it
>
> That way there is no possibility of a deadlock.
>
> This poses a question on how to enforce it and this bit is easy: for
> example one can add leaf-spinlock notion to lockdep. Then a misuse on
> allocation side is going to complain immediately even without
> triggering reclaim. Further, if one would feel so inclined, a test
> module can walk the list of all slab caches and do a populate/reclaim
> cycle on those with the ctor/dtor pair.
>
> Then there is the matter of particular consumers being ready to do
> what they need to on the dtor side only with the spinlock. Does not
> sound like a fundamental problem.
>
> > Case in point:
> > What happens if you allocate a slab and start ->ctor()-ing objects, and then
> > one of the ctors fails? We need to free the ctor, but not without ->dtor()-ing
> > everything back (AIUI this is not handled in this series, yet). Besides this
> > complication, if failing dtors were added into the mix, we'd be left with a
> > half-initialized slab(!!) in the middle of the cache waiting to get freed,
> > without being able to.
> >
>
> Per my previous paragraph failing dtors would be a self-induced problem.
>
> I can agree one has to roll things back if ctors don't work out, but I
> don't think this poses a significant problem.
>
> > Then there are obviously other problems like: whatever you're calling must
> > not ever require the slab allocator (directly or indirectly) and must not
> > do direct reclaim (ever!), at the risk of a deadlock. The pcpu allocator
> > is a no-go (AIUI!) already because of such issues.
> >
>
> I don't see how that's true.
>
> > Then there's the separate (but adjacent, particularly as we're considering
> > this series due to performance improvements) issue that the ctor() and
> > dtor() interfaces are terrible, in the sense that they do not let you batch
> > in any way shape or form (requiring us to lock/unlock many times, allocate
> > many times, etc). If this is done for performance improvements, I would prefer
> > a superior ctor/dtor interface that takes something like a slab iterator and
> > lets you do these things.
> >
>
> Batching this is also something I mentioned and indeed is a "nice to
> have" change. Note however that the work you are suggesting to batch
> now also on every alloc/free cycle, so doing it once per creation of a
> given object instead is already a win.
>

Whether the ctor/dtor thing lands or not, I would like to point out
the current state is quite nasty and something(tm) needs to be done.

The mm object is allocated from a per-cpu cache, only to have the
mandatory initialization globally serialize *several* times, including
twice on the same lock. This is so bad that performance would be
better if someone created a globally-locked cache with mms still
holding onto per-cpu memory et al.

Or to put it differently, existence of a per-cpu cache of mm objs is
defeated by the global locking endured for each alloc/free cycle (and
this goes beyond percpu memory allocs for counters).

So another idea would be to instead create a cache with *some*
granularity (say 4 or 8 cpu threads per instance). Note this should
reduce the total number of mms allocated (but unused) in the system.
If mms hanging out there would still be populated like in this
patchset, perhaps the reduction in objs which are "wasted" would be
sufficient to ignore direct reclaim? Instead if need be this would be
reclaimable from a dedicated thread (whatever it is in Linux).

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Mateusz Guzik 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> Overview
> ========
>
> The slab destructor feature existed in early days of slab allocator(s).
> It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> for destructors") in 2007 due to lack of serious use cases at that time.
>
> Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> constructor/destructor pair to mitigate the global serialization point
> (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> percpu memory during its lifetime.
>
> Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> so each allocate–free cycle requires two expensive acquire/release on
> that mutex.
>
> We can mitigate this contention by retaining the percpu regions after
> the object is freed and releasing them only when the backing slab pages
> are freed.
>
> How to do this with slab constructors and destructors: the constructor
> allocates percpu memory, and the destructor frees it when the slab pages
> are reclaimed; this slightly alters the constructor’s semantics,
> as it can now fail.
>
> This series is functional (although not compatible with MM debug
> features yet), but still far from perfect. I’m actively refining it and
> would appreciate early feedback before I improve it further. :)
>

Thanks for looking into this.

The dtor thing poses a potential problem where a dtor acquiring
arbitrary locks can result in a deadlock during memory reclaim.

So for this to be viable one needs to ensure that in the worst case
this only ever takes leaf-locks (i.e., locks which are last in any
dependency chain -- no locks are being taken if you hold one). This
needs to demonstrate the percpu thing qualifies or needs to refactor
it to that extent.


> This series is based on slab/for-next [2].
>
> Performance Improvement
> =======================
>
> I measured the benefit of this series for two different users:
> exec() and tc filter insertion/removal.
>
> exec() throughput
> -----------------
>
> The performance of exec() is important when short-lived processes are
> frequently created. For example: shell-heavy workloads and running many
> test cases [3].
>
> I measured exec() throughput with a microbenchmark:
>   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
>   - 4.56% gain on a desktop with 24 hardware threads, and
>   - Even 4% gain on a single-threaded exec() throughput.
>
> Further investigation showed that this was due to the overhead of
> acquiring/releasing pcpu_alloc_mutex and its contention.
>
> See patch 7 for more detail on the experiment.
>
> Traffic Filter Insertion and Removal
> ------------------------------------
>
> Each tc filter allocates three percpu memory regions per tc_action object,
> so frequently inserting and removing filters contend heavily on the same
> mutex.
>
> In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> more detail), I observed a 26% reduction in system time and observed
> much less contention on pcpu_alloc_mutex with this series.
>
> I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> about tc filter insertion rate; these changes may still benefit
> workloads they run today.
>
> Feedback Needed from Percpu Allocator Folks
> ===========================================
>
> As percpu allocator is directly affected by this series, this work
> will need support from the percpu allocator maintainers, and we need to
> address their concerns.
>
> They will probably say "This is a percpu memory allocator scalability
> issue and we need to make it scalable"? I don't know.
>
> What do you say?
>
> Some hanging thoughts:
> - Tackling the problem on the slab side is much simpler, because the slab
>   allocator already caches objects per CPU. Re-creating similar logic
>   inside the percpu allocator would be redundant.
>
>   Also, since this is opt-in per slab cache, other percpu allocator
>   users remain unaffected.
>
> - If fragmentation is a concern, we could probably allocate larger percpu
>   chunks and partition them for slab objects.
>
> - If memory overhead becomes an issue, we could introduce a shrinker
>   to free empty slabs (and thus releasing underlying percpu memory chunks).
>
> Patch Sequence
> ==============
>
> Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> fail in the next patch.
>
> Patch #2 allows the slab constructor fail.
>
> Patch #3 implements the slab destructor feature.
>
> Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
>
> Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> percpu counter.
>
> Patch #7 converts mm_struct to use the slab ctor/dtor pair.
>
> Known issues with MM debug features
> ===================================
>
> The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> and DEBUG_OBJECTS.
>
> KASAN reports an error when a percpu counter is inserted into the
> percpu_counters linked list because the counter has not been allocated
> yet.
>
> DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> the associated percpu memory is still resident in memory.
>
> I don't expect fixing these issues to be too difficult, but I need to
> think a little bit to fix it.
>
> [1] https://lore.kernel.org/linux-mm/CAGudoHFc+Km-3usiy4Wdm1JkM+YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com
>
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab/for-next
>
> [3] https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3
>
> [4] https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com
>
> Harry Yoo (7):
>   mm/slab: refactor freelist shuffle
>   treewide, slab: allow slab constructor to return an error
>   mm/slab: revive the destructor feature in slab allocator
>   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
>     alloc
>   mm/percpu: allow (un)charging objects without alloc/free
>   lib/percpu_counter: allow (un)charging percpu counters without
>     alloc/free
>   kernel/fork: improve exec() throughput with slab ctor/dtor pair
>
>  arch/powerpc/include/asm/svm.h            |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
>  arch/powerpc/mm/init-common.c             |   3 +-
>  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
>  arch/powerpc/platforms/pseries/setup.c    |   2 +-
>  arch/powerpc/platforms/pseries/svm.c      |   4 +-
>  arch/sh/mm/pgtable.c                      |   3 +-
>  arch/sparc/mm/tsb.c                       |   8 +-
>  block/bdev.c                              |   3 +-
>  drivers/dax/super.c                       |   3 +-
>  drivers/gpu/drm/i915/i915_request.c       |   3 +-
>  drivers/misc/lkdtm/heap.c                 |  12 +--
>  drivers/usb/mon/mon_text.c                |   5 +-
>  fs/9p/v9fs.c                              |   3 +-
>  fs/adfs/super.c                           |   3 +-
>  fs/affs/super.c                           |   3 +-
>  fs/afs/super.c                            |   5 +-
>  fs/befs/linuxvfs.c                        |   3 +-
>  fs/bfs/inode.c                            |   3 +-
>  fs/btrfs/inode.c                          |   3 +-
>  fs/ceph/super.c                           |   3 +-
>  fs/coda/inode.c                           |   3 +-
>  fs/debugfs/inode.c                        |   3 +-
>  fs/dlm/lowcomms.c                         |   3 +-
>  fs/ecryptfs/main.c                        |   5 +-
>  fs/efs/super.c                            |   3 +-
>  fs/erofs/super.c                          |   3 +-
>  fs/exfat/cache.c                          |   3 +-
>  fs/exfat/super.c                          |   3 +-
>  fs/ext2/super.c                           |   3 +-
>  fs/ext4/super.c                           |   3 +-
>  fs/fat/cache.c                            |   3 +-
>  fs/fat/inode.c                            |   3 +-
>  fs/fuse/inode.c                           |   3 +-
>  fs/gfs2/main.c                            |   9 +-
>  fs/hfs/super.c                            |   3 +-
>  fs/hfsplus/super.c                        |   3 +-
>  fs/hpfs/super.c                           |   3 +-
>  fs/hugetlbfs/inode.c                      |   3 +-
>  fs/inode.c                                |   3 +-
>  fs/isofs/inode.c                          |   3 +-
>  fs/jffs2/super.c                          |   3 +-
>  fs/jfs/super.c                            |   3 +-
>  fs/minix/inode.c                          |   3 +-
>  fs/nfs/inode.c                            |   3 +-
>  fs/nfs/nfs42xattr.c                       |   3 +-
>  fs/nilfs2/super.c                         |   6 +-
>  fs/ntfs3/super.c                          |   3 +-
>  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
>  fs/ocfs2/super.c                          |   3 +-
>  fs/openpromfs/inode.c                     |   3 +-
>  fs/orangefs/super.c                       |   3 +-
>  fs/overlayfs/super.c                      |   3 +-
>  fs/pidfs.c                                |   3 +-
>  fs/proc/inode.c                           |   3 +-
>  fs/qnx4/inode.c                           |   3 +-
>  fs/qnx6/inode.c                           |   3 +-
>  fs/romfs/super.c                          |   3 +-
>  fs/smb/client/cifsfs.c                    |   3 +-
>  fs/squashfs/super.c                       |   3 +-
>  fs/tracefs/inode.c                        |   3 +-
>  fs/ubifs/super.c                          |   3 +-
>  fs/udf/super.c                            |   3 +-
>  fs/ufs/super.c                            |   3 +-
>  fs/userfaultfd.c                          |   3 +-
>  fs/vboxsf/super.c                         |   3 +-
>  fs/xfs/xfs_super.c                        |   3 +-
>  include/linux/mm_types.h                  |  40 ++++++---
>  include/linux/percpu.h                    |  10 +++
>  include/linux/percpu_counter.h            |   2 +
>  include/linux/slab.h                      |  21 +++--
>  ipc/mqueue.c                              |   3 +-
>  kernel/fork.c                             |  65 +++++++++-----
>  kernel/rcu/refscale.c                     |   3 +-
>  lib/percpu_counter.c                      |  25 ++++++
>  lib/radix-tree.c                          |   3 +-
>  lib/test_meminit.c                        |   3 +-
>  mm/kfence/kfence_test.c                   |   5 +-
>  mm/percpu.c                               |  79 ++++++++++------
>  mm/rmap.c                                 |   3 +-
>  mm/shmem.c                                |   3 +-
>  mm/slab.h                                 |  11 +--
>  mm/slab_common.c                          |  43 +++++----
>  mm/slub.c                                 | 105 ++++++++++++++++------
>  net/sched/act_api.c                       |  82 +++++++++++------
>  net/socket.c                              |   3 +-
>  net/sunrpc/rpc_pipe.c                     |   3 +-
>  security/integrity/ima/ima_iint.c         |   3 +-
>  88 files changed, 518 insertions(+), 226 deletions(-)
>
> --
> 2.43.0
>


-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Harry Yoo 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > Overview
> > ========
> >
> > The slab destructor feature existed in early days of slab allocator(s).
> > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > for destructors") in 2007 due to lack of serious use cases at that time.
> >
> > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > constructor/destructor pair to mitigate the global serialization point
> > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > percpu memory during its lifetime.
> >
> > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > so each allocate–free cycle requires two expensive acquire/release on
> > that mutex.
> >
> > We can mitigate this contention by retaining the percpu regions after
> > the object is freed and releasing them only when the backing slab pages
> > are freed.
> >
> > How to do this with slab constructors and destructors: the constructor
> > allocates percpu memory, and the destructor frees it when the slab pages
> > are reclaimed; this slightly alters the constructor’s semantics,
> > as it can now fail.
> >
> > This series is functional (although not compatible with MM debug
> > features yet), but still far from perfect. I’m actively refining it and
> > would appreciate early feedback before I improve it further. :)
> >
> 
> Thanks for looking into this.

You're welcome. Thanks for the proposal.

> The dtor thing poses a potential problem where a dtor acquiring
> arbitrary locks can result in a deadlock during memory reclaim.

AFAICT, MM does not reclaim slab memory unless we register shrinker
interface to reclaim it. Or am I missing something?

Hmm let's say it does anyway, then is this what you worry about?

someone requests percpu memory
-> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
-> allocates pages from buddy
-> buddy reclaims slab memory
-> slab destructor calls pcpu_alloc_mutex (deadlock!)

> So for this to be viable one needs to ensure that in the worst case
> this only ever takes leaf-locks (i.e., locks which are last in any
> dependency chain -- no locks are being taken if you hold one).

Oh, then you can't allocate memory while holding pcpu_lock or
pcpu_alloc_mutex?

> This
> needs to demonstrate the percpu thing qualifies or needs to refactor
> it to that extent.
>
> > This series is based on slab/for-next [2].
> >
> > Performance Improvement
> > =======================
> >
> > I measured the benefit of this series for two different users:
> > exec() and tc filter insertion/removal.
> >
> > exec() throughput
> > -----------------
> >
> > The performance of exec() is important when short-lived processes are
> > frequently created. For example: shell-heavy workloads and running many
> > test cases [3].
> >
> > I measured exec() throughput with a microbenchmark:
> >   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> >   - 4.56% gain on a desktop with 24 hardware threads, and
> >   - Even 4% gain on a single-threaded exec() throughput.
> >
> > Further investigation showed that this was due to the overhead of
> > acquiring/releasing pcpu_alloc_mutex and its contention.
> >
> > See patch 7 for more detail on the experiment.
> >
> > Traffic Filter Insertion and Removal
> > ------------------------------------
> >
> > Each tc filter allocates three percpu memory regions per tc_action object,
> > so frequently inserting and removing filters contend heavily on the same
> > mutex.
> >
> > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > more detail), I observed a 26% reduction in system time and observed
> > much less contention on pcpu_alloc_mutex with this series.
> >
> > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > about tc filter insertion rate; these changes may still benefit
> > workloads they run today.
> >
> > Feedback Needed from Percpu Allocator Folks
> > ===========================================
> >
> > As percpu allocator is directly affected by this series, this work
> > will need support from the percpu allocator maintainers, and we need to
> > address their concerns.
> >
> > They will probably say "This is a percpu memory allocator scalability
> > issue and we need to make it scalable"? I don't know.
> >
> > What do you say?
> >
> > Some hanging thoughts:
> > - Tackling the problem on the slab side is much simpler, because the slab
> >   allocator already caches objects per CPU. Re-creating similar logic
> >   inside the percpu allocator would be redundant.
> >
> >   Also, since this is opt-in per slab cache, other percpu allocator
> >   users remain unaffected.
> >
> > - If fragmentation is a concern, we could probably allocate larger percpu
> >   chunks and partition them for slab objects.
> >
> > - If memory overhead becomes an issue, we could introduce a shrinker
> >   to free empty slabs (and thus releasing underlying percpu memory chunks).
> >
> > Patch Sequence
> > ==============
> >
> > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > fail in the next patch.
> >
> > Patch #2 allows the slab constructor fail.
> >
> > Patch #3 implements the slab destructor feature.
> >
> > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> >
> > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > percpu counter.
> >
> > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> >
> > Known issues with MM debug features
> > ===================================
> >
> > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > and DEBUG_OBJECTS.
> >
> > KASAN reports an error when a percpu counter is inserted into the
> > percpu_counters linked list because the counter has not been allocated
> > yet.
> >
> > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > the associated percpu memory is still resident in memory.
> >
> > I don't expect fixing these issues to be too difficult, but I need to
> > think a little bit to fix it.
> >
> > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$ 
> >
> > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$ 
> >
> > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$ 
> >
> > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$ 
> >
> > Harry Yoo (7):
> >   mm/slab: refactor freelist shuffle
> >   treewide, slab: allow slab constructor to return an error
> >   mm/slab: revive the destructor feature in slab allocator
> >   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> >     alloc
> >   mm/percpu: allow (un)charging objects without alloc/free
> >   lib/percpu_counter: allow (un)charging percpu counters without
> >     alloc/free
> >   kernel/fork: improve exec() throughput with slab ctor/dtor pair
> >
> >  arch/powerpc/include/asm/svm.h            |   2 +-
> >  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
> >  arch/powerpc/mm/init-common.c             |   3 +-
> >  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
> >  arch/powerpc/platforms/pseries/setup.c    |   2 +-
> >  arch/powerpc/platforms/pseries/svm.c      |   4 +-
> >  arch/sh/mm/pgtable.c                      |   3 +-
> >  arch/sparc/mm/tsb.c                       |   8 +-
> >  block/bdev.c                              |   3 +-
> >  drivers/dax/super.c                       |   3 +-
> >  drivers/gpu/drm/i915/i915_request.c       |   3 +-
> >  drivers/misc/lkdtm/heap.c                 |  12 +--
> >  drivers/usb/mon/mon_text.c                |   5 +-
> >  fs/9p/v9fs.c                              |   3 +-
> >  fs/adfs/super.c                           |   3 +-
> >  fs/affs/super.c                           |   3 +-
> >  fs/afs/super.c                            |   5 +-
> >  fs/befs/linuxvfs.c                        |   3 +-
> >  fs/bfs/inode.c                            |   3 +-
> >  fs/btrfs/inode.c                          |   3 +-
> >  fs/ceph/super.c                           |   3 +-
> >  fs/coda/inode.c                           |   3 +-
> >  fs/debugfs/inode.c                        |   3 +-
> >  fs/dlm/lowcomms.c                         |   3 +-
> >  fs/ecryptfs/main.c                        |   5 +-
> >  fs/efs/super.c                            |   3 +-
> >  fs/erofs/super.c                          |   3 +-
> >  fs/exfat/cache.c                          |   3 +-
> >  fs/exfat/super.c                          |   3 +-
> >  fs/ext2/super.c                           |   3 +-
> >  fs/ext4/super.c                           |   3 +-
> >  fs/fat/cache.c                            |   3 +-
> >  fs/fat/inode.c                            |   3 +-
> >  fs/fuse/inode.c                           |   3 +-
> >  fs/gfs2/main.c                            |   9 +-
> >  fs/hfs/super.c                            |   3 +-
> >  fs/hfsplus/super.c                        |   3 +-
> >  fs/hpfs/super.c                           |   3 +-
> >  fs/hugetlbfs/inode.c                      |   3 +-
> >  fs/inode.c                                |   3 +-
> >  fs/isofs/inode.c                          |   3 +-
> >  fs/jffs2/super.c                          |   3 +-
> >  fs/jfs/super.c                            |   3 +-
> >  fs/minix/inode.c                          |   3 +-
> >  fs/nfs/inode.c                            |   3 +-
> >  fs/nfs/nfs42xattr.c                       |   3 +-
> >  fs/nilfs2/super.c                         |   6 +-
> >  fs/ntfs3/super.c                          |   3 +-
> >  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
> >  fs/ocfs2/super.c                          |   3 +-
> >  fs/openpromfs/inode.c                     |   3 +-
> >  fs/orangefs/super.c                       |   3 +-
> >  fs/overlayfs/super.c                      |   3 +-
> >  fs/pidfs.c                                |   3 +-
> >  fs/proc/inode.c                           |   3 +-
> >  fs/qnx4/inode.c                           |   3 +-
> >  fs/qnx6/inode.c                           |   3 +-
> >  fs/romfs/super.c                          |   3 +-
> >  fs/smb/client/cifsfs.c                    |   3 +-
> >  fs/squashfs/super.c                       |   3 +-
> >  fs/tracefs/inode.c                        |   3 +-
> >  fs/ubifs/super.c                          |   3 +-
> >  fs/udf/super.c                            |   3 +-
> >  fs/ufs/super.c                            |   3 +-
> >  fs/userfaultfd.c                          |   3 +-
> >  fs/vboxsf/super.c                         |   3 +-
> >  fs/xfs/xfs_super.c                        |   3 +-
> >  include/linux/mm_types.h                  |  40 ++++++---
> >  include/linux/percpu.h                    |  10 +++
> >  include/linux/percpu_counter.h            |   2 +
> >  include/linux/slab.h                      |  21 +++--
> >  ipc/mqueue.c                              |   3 +-
> >  kernel/fork.c                             |  65 +++++++++-----
> >  kernel/rcu/refscale.c                     |   3 +-
> >  lib/percpu_counter.c                      |  25 ++++++
> >  lib/radix-tree.c                          |   3 +-
> >  lib/test_meminit.c                        |   3 +-
> >  mm/kfence/kfence_test.c                   |   5 +-
> >  mm/percpu.c                               |  79 ++++++++++------
> >  mm/rmap.c                                 |   3 +-
> >  mm/shmem.c                                |   3 +-
> >  mm/slab.h                                 |  11 +--
> >  mm/slab_common.c                          |  43 +++++----
> >  mm/slub.c                                 | 105 ++++++++++++++++------
> >  net/sched/act_api.c                       |  82 +++++++++++------
> >  net/socket.c                              |   3 +-
> >  net/sunrpc/rpc_pipe.c                     |   3 +-
> >  security/integrity/ima/ima_iint.c         |   3 +-
> >  88 files changed, 518 insertions(+), 226 deletions(-)
> >
> > --
> > 2.43.0
> >
> 
> 
> -- 
> Mateusz Guzik <mjguzik gmail.com>

-- 
Cheers,
Harry / Hyeonggon

Re: [RFC PATCH 0/7] Reviving the slab destructor to tackle the percpu allocator scalability problem

Posted by Mateusz Guzik 9 months, 2 weeks ago

On Thu, Apr 24, 2025 at 11:58 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Thu, Apr 24, 2025 at 11:29:13AM +0200, Mateusz Guzik wrote:
> > On Thu, Apr 24, 2025 at 10:08 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > Overview
> > > ========
> > >
> > > The slab destructor feature existed in early days of slab allocator(s).
> > > It was removed by the commit c59def9f222d ("Slab allocators: Drop support
> > > for destructors") in 2007 due to lack of serious use cases at that time.
> > >
> > > Eighteen years later, Mateusz Guzik proposed [1] re-introducing a slab
> > > constructor/destructor pair to mitigate the global serialization point
> > > (pcpu_alloc_mutex) that occurs when each slab object allocates and frees
> > > percpu memory during its lifetime.
> > >
> > > Consider mm_struct: it allocates two percpu regions (mm_cid and rss_stat),
> > > so each allocate–free cycle requires two expensive acquire/release on
> > > that mutex.
> > >
> > > We can mitigate this contention by retaining the percpu regions after
> > > the object is freed and releasing them only when the backing slab pages
> > > are freed.
> > >
> > > How to do this with slab constructors and destructors: the constructor
> > > allocates percpu memory, and the destructor frees it when the slab pages
> > > are reclaimed; this slightly alters the constructor’s semantics,
> > > as it can now fail.
> > >
> > > This series is functional (although not compatible with MM debug
> > > features yet), but still far from perfect. I’m actively refining it and
> > > would appreciate early feedback before I improve it further. :)
> > >
> >
> > Thanks for looking into this.
>
> You're welcome. Thanks for the proposal.
>
> > The dtor thing poses a potential problem where a dtor acquiring
> > arbitrary locks can result in a deadlock during memory reclaim.
>
> AFAICT, MM does not reclaim slab memory unless we register shrinker
> interface to reclaim it. Or am I missing something?
>
> Hmm let's say it does anyway, then is this what you worry about?
>
> someone requests percpu memory
> -> percpu allocator takes a lock (e.g., pcpu_alloc_mutex)
> -> allocates pages from buddy
> -> buddy reclaims slab memory
> -> slab destructor calls pcpu_alloc_mutex (deadlock!)
>
> > So for this to be viable one needs to ensure that in the worst case
> > this only ever takes leaf-locks (i.e., locks which are last in any
> > dependency chain -- no locks are being taken if you hold one).
>
> Oh, then you can't allocate memory while holding pcpu_lock or
> pcpu_alloc_mutex?
>

It should be perfectly fine to allocate memory with pcpu_alloc_mutex
as it is not an inherent part of reclaim.

The part used by dtor would be the spinlock already present in the
percpu allocator.

The idea would be the mutex-protected area preps everything as needed,
but synchronisation with freeing the pcpu areas would only need the
leaf spinlock.

So issues there provided some care is employed.

> > This
> > needs to demonstrate the percpu thing qualifies or needs to refactor
> > it to that extent.
> >
> > > This series is based on slab/for-next [2].
> > >
> > > Performance Improvement
> > > =======================
> > >
> > > I measured the benefit of this series for two different users:
> > > exec() and tc filter insertion/removal.
> > >
> > > exec() throughput
> > > -----------------
> > >
> > > The performance of exec() is important when short-lived processes are
> > > frequently created. For example: shell-heavy workloads and running many
> > > test cases [3].
> > >
> > > I measured exec() throughput with a microbenchmark:
> > >   - 33% of exec() throughput gain on 2-socket machine with 192 CPUs,
> > >   - 4.56% gain on a desktop with 24 hardware threads, and
> > >   - Even 4% gain on a single-threaded exec() throughput.
> > >
> > > Further investigation showed that this was due to the overhead of
> > > acquiring/releasing pcpu_alloc_mutex and its contention.
> > >
> > > See patch 7 for more detail on the experiment.
> > >
> > > Traffic Filter Insertion and Removal
> > > ------------------------------------
> > >
> > > Each tc filter allocates three percpu memory regions per tc_action object,
> > > so frequently inserting and removing filters contend heavily on the same
> > > mutex.
> > >
> > > In the Linux-kernel tools/testing tc-filter benchmark (see patch 4 for
> > > more detail), I observed a 26% reduction in system time and observed
> > > much less contention on pcpu_alloc_mutex with this series.
> > >
> > > I saw in old mailing list threads Mellanox (now NVIDIA) engineers cared
> > > about tc filter insertion rate; these changes may still benefit
> > > workloads they run today.
> > >
> > > Feedback Needed from Percpu Allocator Folks
> > > ===========================================
> > >
> > > As percpu allocator is directly affected by this series, this work
> > > will need support from the percpu allocator maintainers, and we need to
> > > address their concerns.
> > >
> > > They will probably say "This is a percpu memory allocator scalability
> > > issue and we need to make it scalable"? I don't know.
> > >
> > > What do you say?
> > >
> > > Some hanging thoughts:
> > > - Tackling the problem on the slab side is much simpler, because the slab
> > >   allocator already caches objects per CPU. Re-creating similar logic
> > >   inside the percpu allocator would be redundant.
> > >
> > >   Also, since this is opt-in per slab cache, other percpu allocator
> > >   users remain unaffected.
> > >
> > > - If fragmentation is a concern, we could probably allocate larger percpu
> > >   chunks and partition them for slab objects.
> > >
> > > - If memory overhead becomes an issue, we could introduce a shrinker
> > >   to free empty slabs (and thus releasing underlying percpu memory chunks).
> > >
> > > Patch Sequence
> > > ==============
> > >
> > > Patch #1 refactors freelist_shuffle() to allow the slab constructor to
> > > fail in the next patch.
> > >
> > > Patch #2 allows the slab constructor fail.
> > >
> > > Patch #3 implements the slab destructor feature.
> > >
> > > Patch #4 converts net/sched/act_api to use the slab ctor/dtor pair.
> > >
> > > Patch #5, #6 implements APIs to charge and uncharge percpu memory and
> > > percpu counter.
> > >
> > > Patch #7 converts mm_struct to use the slab ctor/dtor pair.
> > >
> > > Known issues with MM debug features
> > > ===================================
> > >
> > > The slab destructor feature is not yet compatible with KASAN, KMEMLEAK,
> > > and DEBUG_OBJECTS.
> > >
> > > KASAN reports an error when a percpu counter is inserted into the
> > > percpu_counters linked list because the counter has not been allocated
> > > yet.
> > >
> > > DEBUG_OBJECTS and KMEMLEAK complain when the slab object is freed, while
> > > the associated percpu memory is still resident in memory.
> > >
> > > I don't expect fixing these issues to be too difficult, but I need to
> > > think a little bit to fix it.
> > >
> > > [1] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/CAGudoHFc*Km-3usiy4Wdm1JkM*YjCgD9A8dDKQ06pZP070f1ig@mail.gmail.com__;Kys!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnI6wgKqLM$
> > >
> > > [2] https://urldefense.com/v3/__https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab.git/log/?h=slab*for-next__;Lw!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIGu8ThaA$
> > >
> > > [3] https://urldefense.com/v3/__https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIN_ctSTM$
> > >
> > > [4] https://urldefense.com/v3/__https://lore.kernel.org/netdev/vbfmunui7dm.fsf@mellanox.com__;!!ACWV5N9M2RV99hQ!K8JHFp0DM1nkYDDnSbJNnwLOl-6PSEXnUlekFs6paI9bGha34XCp9q9wKF6E8S1I4ZHZKpnIDPKy5XU$
> > >
> > > Harry Yoo (7):
> > >   mm/slab: refactor freelist shuffle
> > >   treewide, slab: allow slab constructor to return an error
> > >   mm/slab: revive the destructor feature in slab allocator
> > >   net/sched/act_api: use slab ctor/dtor to reduce contention on pcpu
> > >     alloc
> > >   mm/percpu: allow (un)charging objects without alloc/free
> > >   lib/percpu_counter: allow (un)charging percpu counters without
> > >     alloc/free
> > >   kernel/fork: improve exec() throughput with slab ctor/dtor pair
> > >
> > >  arch/powerpc/include/asm/svm.h            |   2 +-
> > >  arch/powerpc/kvm/book3s_64_mmu_radix.c    |   3 +-
> > >  arch/powerpc/mm/init-common.c             |   3 +-
> > >  arch/powerpc/platforms/cell/spufs/inode.c |   3 +-
> > >  arch/powerpc/platforms/pseries/setup.c    |   2 +-
> > >  arch/powerpc/platforms/pseries/svm.c      |   4 +-
> > >  arch/sh/mm/pgtable.c                      |   3 +-
> > >  arch/sparc/mm/tsb.c                       |   8 +-
> > >  block/bdev.c                              |   3 +-
> > >  drivers/dax/super.c                       |   3 +-
> > >  drivers/gpu/drm/i915/i915_request.c       |   3 +-
> > >  drivers/misc/lkdtm/heap.c                 |  12 +--
> > >  drivers/usb/mon/mon_text.c                |   5 +-
> > >  fs/9p/v9fs.c                              |   3 +-
> > >  fs/adfs/super.c                           |   3 +-
> > >  fs/affs/super.c                           |   3 +-
> > >  fs/afs/super.c                            |   5 +-
> > >  fs/befs/linuxvfs.c                        |   3 +-
> > >  fs/bfs/inode.c                            |   3 +-
> > >  fs/btrfs/inode.c                          |   3 +-
> > >  fs/ceph/super.c                           |   3 +-
> > >  fs/coda/inode.c                           |   3 +-
> > >  fs/debugfs/inode.c                        |   3 +-
> > >  fs/dlm/lowcomms.c                         |   3 +-
> > >  fs/ecryptfs/main.c                        |   5 +-
> > >  fs/efs/super.c                            |   3 +-
> > >  fs/erofs/super.c                          |   3 +-
> > >  fs/exfat/cache.c                          |   3 +-
> > >  fs/exfat/super.c                          |   3 +-
> > >  fs/ext2/super.c                           |   3 +-
> > >  fs/ext4/super.c                           |   3 +-
> > >  fs/fat/cache.c                            |   3 +-
> > >  fs/fat/inode.c                            |   3 +-
> > >  fs/fuse/inode.c                           |   3 +-
> > >  fs/gfs2/main.c                            |   9 +-
> > >  fs/hfs/super.c                            |   3 +-
> > >  fs/hfsplus/super.c                        |   3 +-
> > >  fs/hpfs/super.c                           |   3 +-
> > >  fs/hugetlbfs/inode.c                      |   3 +-
> > >  fs/inode.c                                |   3 +-
> > >  fs/isofs/inode.c                          |   3 +-
> > >  fs/jffs2/super.c                          |   3 +-
> > >  fs/jfs/super.c                            |   3 +-
> > >  fs/minix/inode.c                          |   3 +-
> > >  fs/nfs/inode.c                            |   3 +-
> > >  fs/nfs/nfs42xattr.c                       |   3 +-
> > >  fs/nilfs2/super.c                         |   6 +-
> > >  fs/ntfs3/super.c                          |   3 +-
> > >  fs/ocfs2/dlmfs/dlmfs.c                    |   3 +-
> > >  fs/ocfs2/super.c                          |   3 +-
> > >  fs/openpromfs/inode.c                     |   3 +-
> > >  fs/orangefs/super.c                       |   3 +-
> > >  fs/overlayfs/super.c                      |   3 +-
> > >  fs/pidfs.c                                |   3 +-
> > >  fs/proc/inode.c                           |   3 +-
> > >  fs/qnx4/inode.c                           |   3 +-
> > >  fs/qnx6/inode.c                           |   3 +-
> > >  fs/romfs/super.c                          |   3 +-
> > >  fs/smb/client/cifsfs.c                    |   3 +-
> > >  fs/squashfs/super.c                       |   3 +-
> > >  fs/tracefs/inode.c                        |   3 +-
> > >  fs/ubifs/super.c                          |   3 +-
> > >  fs/udf/super.c                            |   3 +-
> > >  fs/ufs/super.c                            |   3 +-
> > >  fs/userfaultfd.c                          |   3 +-
> > >  fs/vboxsf/super.c                         |   3 +-
> > >  fs/xfs/xfs_super.c                        |   3 +-
> > >  include/linux/mm_types.h                  |  40 ++++++---
> > >  include/linux/percpu.h                    |  10 +++
> > >  include/linux/percpu_counter.h            |   2 +
> > >  include/linux/slab.h                      |  21 +++--
> > >  ipc/mqueue.c                              |   3 +-
> > >  kernel/fork.c                             |  65 +++++++++-----
> > >  kernel/rcu/refscale.c                     |   3 +-
> > >  lib/percpu_counter.c                      |  25 ++++++
> > >  lib/radix-tree.c                          |   3 +-
> > >  lib/test_meminit.c                        |   3 +-
> > >  mm/kfence/kfence_test.c                   |   5 +-
> > >  mm/percpu.c                               |  79 ++++++++++------
> > >  mm/rmap.c                                 |   3 +-
> > >  mm/shmem.c                                |   3 +-
> > >  mm/slab.h                                 |  11 +--
> > >  mm/slab_common.c                          |  43 +++++----
> > >  mm/slub.c                                 | 105 ++++++++++++++++------
> > >  net/sched/act_api.c                       |  82 +++++++++++------
> > >  net/socket.c                              |   3 +-
> > >  net/sunrpc/rpc_pipe.c                     |   3 +-
> > >  security/integrity/ima/ima_iint.c         |   3 +-
> > >  88 files changed, 518 insertions(+), 226 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Mateusz Guzik <mjguzik gmail.com>
>
> --
> Cheers,
> Harry / Hyeonggon



-- 
Mateusz Guzik <mjguzik gmail.com>