[v3] Use kmem_cache for memcg alloc

[PATCH v3 0/3] Use kmem_cache for memcg alloc

Posted by Huan Yang 9 months, 2 weeks ago

The mem_cgroup_alloc function creates mem_cgroup struct and it's associated
structures including mem_cgroup_per_node.
Through detailed analysis on our test machine (Arm64, 16GB RAM, 6.6 kernel,
1 NUMA node, memcgv2 with nokmem,nosocket,cgroup_disable=pressure),
we can observe the memory allocation for these structures using the
following shell commands:
  # Enable tracing
  echo 1 > /sys/kernel/tracing/events/kmem/kmalloc/enable
  echo 1 > /sys/kernel/tracing/tracing_on
  cat /sys/kernel/tracing/trace_pipe | grep kmalloc | grep mem_cgroup

  # Trigger allocation if cgroup subtree do not enable memcg
  echo +memory > /sys/fs/cgroup/cgroup.subtree_control

Ftrace Output:
  # mem_cgroup struct allocation
  sh-6312    [000] ..... 58015.698365: kmalloc:
    call_site=mem_cgroup_css_alloc+0xd8/0x5b4
    ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096
    gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false

  # mem_cgroup_per_node allocation
  sh-6312    [000] ..... 58015.698389: kmalloc:
    call_site=mem_cgroup_css_alloc+0x1d8/0x5b4
    ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096
    gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false

Key Observations:
  1. Both structures use kmalloc with requested sizes between 2KB-4KB
  2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
     (64B, 128B,..., 2KB, 4KB, 8KB)
  3. Memory waste per memcg instance:
      Base struct: 4096 - 2312 = 1784 bytes
      Per-node struct: 4096 - 2896 = 1200 bytes
      Total waste: 2984 bytes (1-node system)
      NUMA scaling: (1200 + 8) * nr_node_ids bytes
So, it's a little waste.

This patchset introduces dedicated kmem_cache:
  Patch2 - mem_cgroup kmem_cache - memcg_cachep
  Patch3 - mem_cgroup_per_node kmem_cache - memcg_pn_cachep

The benefits of this change can be observed with the following tracing
commands:
  # Enable tracing
  echo 1 > /sys/kernel/tracing/events/kmem/kmem_cache_alloc/enable
  echo 1 > /sys/kernel/tracing/tracing_on
  cat /sys/kernel/tracing/trace_pipe | grep kmem_cache_alloc | grep mem_cgroup
  # In another terminal:
  echo +memory > /sys/fs/cgroup/cgroup.subtree_control


The output might now look like this:

  # mem_cgroup struct allocation
  sh-9827     [000] .....   289.513598: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806
    bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
    accounted=false
  # mem_cgroup_per_node allocation
  sh-9827     [000] .....   289.513602: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a
    bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
    accounted=false

This indicates that the `mem_cgroup` struct now requests 2312 bytes
and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
and is allocated 2944 bytes.
The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
`kmem_cache`.

Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:

  # mem_cgroup struct allocation
  sh-9269     [003] .....    80.396366: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
    bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
    accounted=false

  # mem_cgroup_per_node allocation
  sh-9269     [003] .....    80.396411: kmem_cache_alloc:
    call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
    bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
    accounted=false

While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
for performance. Please let me know if there are any issues or if I've
misunderstood anything.

This patchset also move mem_cgroup_init ahead of cgroup_init() due to
cgroup_init() will allocate root_mem_cgroup, but each initcall invoke after
cgroup_init, so if each kmem_cache do not prepare, we need testing NULL before
use it.

ChangeLog:
 v2 -> v3:
   Move v2 patch3 ahead, reuse and move mem_cgroup_init ahead of cgroup_init.
 v1 -> v2:
   Patch1-2 simple change commit message.
   Patch3: Add mem_cgroup_init_early to help "memcg" prepare resources
           before cgroup_init().

v2: https://lore.kernel.org/all/20250424120937.96164-1-link@vivo.com/
v1: https://lore.kernel.org/all/20250423084306.65706-1-link@vivo.com/


Huan Yang (3):
  mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
  mm/memcg: use kmem_cache when alloc memcg
  mm/memcg: use kmem_cache when alloc memcg pernode info

 include/linux/memcontrol.h |  3 +++
 init/main.c                |  2 ++
 mm/memcontrol.c            | 21 ++++++++++++++++-----
 3 files changed, 21 insertions(+), 5 deletions(-)


base-commit: 2c9c612abeb38aab0e87d48496de6fd6daafb00b
--
2.48.1

Re: [PATCH v3 0/3] Use kmem_cache for memcg alloc

Posted by Matthew Wilcox 9 months, 2 weeks ago

On Fri, Apr 25, 2025 at 11:19:22AM +0800, Huan Yang wrote:
> Key Observations:
>   1. Both structures use kmalloc with requested sizes between 2KB-4KB
>   2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
>      (64B, 128B,..., 2KB, 4KB, 8KB)
>   3. Memory waste per memcg instance:
>       Base struct: 4096 - 2312 = 1784 bytes
>       Per-node struct: 4096 - 2896 = 1200 bytes
>       Total waste: 2984 bytes (1-node system)
>       NUMA scaling: (1200 + 8) * nr_node_ids bytes
> So, it's a little waste.

[...]

> This indicates that the `mem_cgroup` struct now requests 2312 bytes
> and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
> and is allocated 2944 bytes.
> The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
> `kmem_cache`.
> 
> Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:
> 
>   # mem_cgroup struct allocation
>   sh-9269     [003] .....    80.396366: kmem_cache_alloc:
>     call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
>     bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
>     accounted=false
> 
>   # mem_cgroup_per_node allocation
>   sh-9269     [003] .....    80.396411: kmem_cache_alloc:
>     call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
>     bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
>     accounted=false
> 
> While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
> to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
> for performance. Please let me know if there are any issues or if I've
> misunderstood anything.

This isn't really the right way to think about this.  Memory is ultimately
allocated from the page allocator.  So what you want to know is how many
objects you get per page.  Before, it's one per page (since both objects
are between 2k and 4k and rounded up to 4k).  After, slab will create
slabs of a certain order to minimise waste, but also not inflate the
allocation order too high.  Let's assume it goes all the way to order 3
(like kmalloc-4k does), so you want to know how many objects fit in a
32KiB allocation.

With HWCACHE_ALIGN, you get floor(32768/2368) = 13 and
floor(32768/2944) = 11.

Without HWCACHE_ALIGN( you get floor(32768/2312) = 14 and
floor(32768/2896) = 11.

So there is a packing advantage to turning off HWCACHE_ALIGN (for the
first slab; no difference for the second).  BUT!  Now you have cacheline
aliasing between two objects, and that's probably bad.  It's the kind
of performance problem that's really hard to see.

Anyway, you've gone from allocating 8 objects per 32KiB to allocating
13 objects per 32KiB, a 62% improvement in memory consumption.

Re: [PATCH v3 0/3] Use kmem_cache for memcg alloc

Posted by Huan Yang 9 months, 2 weeks ago

Hi Matthew,

在 2025/4/25 12:35, Matthew Wilcox 写道:
> On Fri, Apr 25, 2025 at 11:19:22AM +0800, Huan Yang wrote:
>> Key Observations:
>>    1. Both structures use kmalloc with requested sizes between 2KB-4KB
>>    2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
>>       (64B, 128B,..., 2KB, 4KB, 8KB)
>>    3. Memory waste per memcg instance:
>>        Base struct: 4096 - 2312 = 1784 bytes
>>        Per-node struct: 4096 - 2896 = 1200 bytes
>>        Total waste: 2984 bytes (1-node system)
>>        NUMA scaling: (1200 + 8) * nr_node_ids bytes
>> So, it's a little waste.
> [...]
>
>> This indicates that the `mem_cgroup` struct now requests 2312 bytes
>> and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
>> and is allocated 2944 bytes.
>> The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
>> `kmem_cache`.
>>
>> Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:
>>
>>    # mem_cgroup struct allocation
>>    sh-9269     [003] .....    80.396366: kmem_cache_alloc:
>>      call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
>>      bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
>>      accounted=false
>>
>>    # mem_cgroup_per_node allocation
>>    sh-9269     [003] .....    80.396411: kmem_cache_alloc:
>>      call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
>>      bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
>>      accounted=false
>>
>> While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
>> to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
>> for performance. Please let me know if there are any issues or if I've
>> misunderstood anything.
> This isn't really the right way to think about this.  Memory is ultimately
> allocated from the page allocator.  So what you want to know is how many
> objects you get per page.  Before, it's one per page (since both objects
> are between 2k and 4k and rounded up to 4k).  After, slab will create
> slabs of a certain order to minimise waste, but also not inflate the
> allocation order too high.  Let's assume it goes all the way to order 3
> (like kmalloc-4k does), so you want to know how many objects fit in a
> 32KiB allocation.
>
> With HWCACHE_ALIGN, you get floor(32768/2368) = 13 and
> floor(32768/2944) = 11.
>
> Without HWCACHE_ALIGN( you get floor(32768/2312) = 14 and
> floor(32768/2896) = 11.

Yes, thanks. And, this can easily observe with the following command:

   # show mem_cgroup slab's order, it's 3.
   cat /sys/kernel/slab/mem_cgroup/order
   # show mem_cgroup slab's objs per slab, it's 13
   cat /sys/kernel/slab/mem_cgroup/objs_per_slab

And we can quickly calculate the Page order obtained by the slab allocation and the number of objs it can store:

# mem_cgroup,2368 size

| ORDER | SIZE | NUM_OBJS | ORIGIN |
| ----------- | ------- | ---------------- | ---------- |
| 3            | 32KB | 13                | 8           |
| 2            | 16KB | 6                  | 4           |
| 1            | 8KB   | 3                  | 2           |
| 0            | 4KB   | 1                  | 1           |

# mem_cgroup_per_node,2944 size

| ORDER | SIZE | NUM_OBJS | ORIGIN |
| ----------- | ------- | ---------------- | ----------  |
| 3            | 32KB | 11                | 8            |
| 2            | 16KB | 5                  | 4            |
| 1            | 8KB   | 2                  | 2            |
| 0            | 4KB   | 1                  | 1            |

So, for mem_cgroup, if page order > 1, then have optimize; while mem_cgroup_per_node needs order 2. :)

>
> So there is a packing advantage to turning off HWCACHE_ALIGN (for the
> first slab; no difference for the second).  BUT!  Now you have cacheline
> aliasing between two objects, and that's probably bad.  It's the kind
> of performance problem that's really hard to see.

Yes, And I would like to learn, in what situations do you think HWCACHE UNALIGN might cause issues?

Could it be direct memory reclaim by multiple processes? Or multiple processes charging memory simultaneously?

>
> Anyway, you've gone from allocating 8 objects per 32KiB to allocating
> 13 objects per 32KiB, a 62% improvement in memory consumption.

Thanks, that's more clearer.

Huan