include/linux/memcontrol.h | 3 +++ init/main.c | 2 ++ mm/memcontrol.c | 21 ++++++++++++++++----- 3 files changed, 21 insertions(+), 5 deletions(-)
The mem_cgroup_alloc function creates mem_cgroup struct and it's associated
structures including mem_cgroup_per_node.
Through detailed analysis on our test machine (Arm64, 16GB RAM, 6.6 kernel,
1 NUMA node, memcgv2 with nokmem,nosocket,cgroup_disable=pressure),
we can observe the memory allocation for these structures using the
following shell commands:
# Enable tracing
echo 1 > /sys/kernel/tracing/events/kmem/kmalloc/enable
echo 1 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace_pipe | grep kmalloc | grep mem_cgroup
# Trigger allocation if cgroup subtree do not enable memcg
echo +memory > /sys/fs/cgroup/cgroup.subtree_control
Ftrace Output:
# mem_cgroup struct allocation
sh-6312 [000] ..... 58015.698365: kmalloc:
call_site=mem_cgroup_css_alloc+0xd8/0x5b4
ptr=000000003e4c3799 bytes_req=2312 bytes_alloc=4096
gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 accounted=false
# mem_cgroup_per_node allocation
sh-6312 [000] ..... 58015.698389: kmalloc:
call_site=mem_cgroup_css_alloc+0x1d8/0x5b4
ptr=00000000d798700c bytes_req=2896 bytes_alloc=4096
gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 accounted=false
Key Observations:
1. Both structures use kmalloc with requested sizes between 2KB-4KB
2. Allocation alignment forces 4KB slab usage due to pre-defined sizes
(64B, 128B,..., 2KB, 4KB, 8KB)
3. Memory waste per memcg instance:
Base struct: 4096 - 2312 = 1784 bytes
Per-node struct: 4096 - 2896 = 1200 bytes
Total waste: 2984 bytes (1-node system)
NUMA scaling: (1200 + 8) * nr_node_ids bytes
So, it's a little waste.
This patchset introduces dedicated kmem_cache:
Patch2 - mem_cgroup kmem_cache - memcg_cachep
Patch3 - mem_cgroup_per_node kmem_cache - memcg_pn_cachep
The benefits of this change can be observed with the following tracing
commands:
# Enable tracing
echo 1 > /sys/kernel/tracing/events/kmem/kmem_cache_alloc/enable
echo 1 > /sys/kernel/tracing/tracing_on
cat /sys/kernel/tracing/trace_pipe | grep kmem_cache_alloc | grep mem_cgroup
# In another terminal:
echo +memory > /sys/fs/cgroup/cgroup.subtree_control
The output might now look like this:
# mem_cgroup struct allocation
sh-9827 [000] ..... 289.513598: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=00000000695c1806
bytes_req=2312 bytes_alloc=2368 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
accounted=false
# mem_cgroup_per_node allocation
sh-9827 [000] ..... 289.513602: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=000000002989e63a
bytes_req=2896 bytes_alloc=2944 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
accounted=false
This indicates that the `mem_cgroup` struct now requests 2312 bytes
and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes
and is allocated 2944 bytes.
The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the
`kmem_cache`.
Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as:
# mem_cgroup struct allocation
sh-9269 [003] ..... 80.396366: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475
bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1
accounted=false
# mem_cgroup_per_node allocation
sh-9269 [003] ..... 80.396411: kmem_cache_alloc:
call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6
bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0
accounted=false
While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults
to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial
for performance. Please let me know if there are any issues or if I've
misunderstood anything.
This patchset also move mem_cgroup_init ahead of cgroup_init() due to
cgroup_init() will allocate root_mem_cgroup, but each initcall invoke after
cgroup_init, so if each kmem_cache do not prepare, we need testing NULL before
use it.
ChangeLog:
v2 -> v3:
Move v2 patch3 ahead, reuse and move mem_cgroup_init ahead of cgroup_init.
v1 -> v2:
Patch1-2 simple change commit message.
Patch3: Add mem_cgroup_init_early to help "memcg" prepare resources
before cgroup_init().
v2: https://lore.kernel.org/all/20250424120937.96164-1-link@vivo.com/
v1: https://lore.kernel.org/all/20250423084306.65706-1-link@vivo.com/
Huan Yang (3):
mm/memcg: move mem_cgroup_init() ahead of cgroup_init()
mm/memcg: use kmem_cache when alloc memcg
mm/memcg: use kmem_cache when alloc memcg pernode info
include/linux/memcontrol.h | 3 +++
init/main.c | 2 ++
mm/memcontrol.c | 21 ++++++++++++++++-----
3 files changed, 21 insertions(+), 5 deletions(-)
base-commit: 2c9c612abeb38aab0e87d48496de6fd6daafb00b
--
2.48.1
On Fri, Apr 25, 2025 at 11:19:22AM +0800, Huan Yang wrote: > Key Observations: > 1. Both structures use kmalloc with requested sizes between 2KB-4KB > 2. Allocation alignment forces 4KB slab usage due to pre-defined sizes > (64B, 128B,..., 2KB, 4KB, 8KB) > 3. Memory waste per memcg instance: > Base struct: 4096 - 2312 = 1784 bytes > Per-node struct: 4096 - 2896 = 1200 bytes > Total waste: 2984 bytes (1-node system) > NUMA scaling: (1200 + 8) * nr_node_ids bytes > So, it's a little waste. [...] > This indicates that the `mem_cgroup` struct now requests 2312 bytes > and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes > and is allocated 2944 bytes. > The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the > `kmem_cache`. > > Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as: > > # mem_cgroup struct allocation > sh-9269 [003] ..... 80.396366: kmem_cache_alloc: > call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475 > bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 > accounted=false > > # mem_cgroup_per_node allocation > sh-9269 [003] ..... 80.396411: kmem_cache_alloc: > call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6 > bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 > accounted=false > > While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults > to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial > for performance. Please let me know if there are any issues or if I've > misunderstood anything. This isn't really the right way to think about this. Memory is ultimately allocated from the page allocator. So what you want to know is how many objects you get per page. Before, it's one per page (since both objects are between 2k and 4k and rounded up to 4k). After, slab will create slabs of a certain order to minimise waste, but also not inflate the allocation order too high. Let's assume it goes all the way to order 3 (like kmalloc-4k does), so you want to know how many objects fit in a 32KiB allocation. With HWCACHE_ALIGN, you get floor(32768/2368) = 13 and floor(32768/2944) = 11. Without HWCACHE_ALIGN( you get floor(32768/2312) = 14 and floor(32768/2896) = 11. So there is a packing advantage to turning off HWCACHE_ALIGN (for the first slab; no difference for the second). BUT! Now you have cacheline aliasing between two objects, and that's probably bad. It's the kind of performance problem that's really hard to see. Anyway, you've gone from allocating 8 objects per 32KiB to allocating 13 objects per 32KiB, a 62% improvement in memory consumption.
Hi Matthew, 在 2025/4/25 12:35, Matthew Wilcox 写道: > On Fri, Apr 25, 2025 at 11:19:22AM +0800, Huan Yang wrote: >> Key Observations: >> 1. Both structures use kmalloc with requested sizes between 2KB-4KB >> 2. Allocation alignment forces 4KB slab usage due to pre-defined sizes >> (64B, 128B,..., 2KB, 4KB, 8KB) >> 3. Memory waste per memcg instance: >> Base struct: 4096 - 2312 = 1784 bytes >> Per-node struct: 4096 - 2896 = 1200 bytes >> Total waste: 2984 bytes (1-node system) >> NUMA scaling: (1200 + 8) * nr_node_ids bytes >> So, it's a little waste. > [...] > >> This indicates that the `mem_cgroup` struct now requests 2312 bytes >> and is allocated 2368 bytes, while `mem_cgroup_per_node` requests 2896 bytes >> and is allocated 2944 bytes. >> The slight increase in allocated size is due to `SLAB_HWCACHE_ALIGN` in the >> `kmem_cache`. >> >> Without `SLAB_HWCACHE_ALIGN`, the allocation might appear as: >> >> # mem_cgroup struct allocation >> sh-9269 [003] ..... 80.396366: kmem_cache_alloc: >> call_site=mem_cgroup_css_alloc+0xbc/0x5d4 ptr=000000005b12b475 >> bytes_req=2312 bytes_alloc=2312 gfp_flags=GFP_KERNEL|__GFP_ZERO node=-1 >> accounted=false >> >> # mem_cgroup_per_node allocation >> sh-9269 [003] ..... 80.396411: kmem_cache_alloc: >> call_site=mem_cgroup_css_alloc+0x1b8/0x5d4 ptr=00000000f347adc6 >> bytes_req=2896 bytes_alloc=2896 gfp_flags=GFP_KERNEL|__GFP_ZERO node=0 >> accounted=false >> >> While the `bytes_alloc` now matches the `bytes_req`, this patchset defaults >> to using `SLAB_HWCACHE_ALIGN` as it is generally considered more beneficial >> for performance. Please let me know if there are any issues or if I've >> misunderstood anything. > This isn't really the right way to think about this. Memory is ultimately > allocated from the page allocator. So what you want to know is how many > objects you get per page. Before, it's one per page (since both objects > are between 2k and 4k and rounded up to 4k). After, slab will create > slabs of a certain order to minimise waste, but also not inflate the > allocation order too high. Let's assume it goes all the way to order 3 > (like kmalloc-4k does), so you want to know how many objects fit in a > 32KiB allocation. > > With HWCACHE_ALIGN, you get floor(32768/2368) = 13 and > floor(32768/2944) = 11. > > Without HWCACHE_ALIGN( you get floor(32768/2312) = 14 and > floor(32768/2896) = 11. Yes, thanks. And, this can easily observe with the following command: # show mem_cgroup slab's order, it's 3. cat /sys/kernel/slab/mem_cgroup/order # show mem_cgroup slab's objs per slab, it's 13 cat /sys/kernel/slab/mem_cgroup/objs_per_slab And we can quickly calculate the Page order obtained by the slab allocation and the number of objs it can store: # mem_cgroup,2368 size | ORDER | SIZE | NUM_OBJS | ORIGIN | | ----------- | ------- | ---------------- | ---------- | | 3 | 32KB | 13 | 8 | | 2 | 16KB | 6 | 4 | | 1 | 8KB | 3 | 2 | | 0 | 4KB | 1 | 1 | # mem_cgroup_per_node,2944 size | ORDER | SIZE | NUM_OBJS | ORIGIN | | ----------- | ------- | ---------------- | ---------- | | 3 | 32KB | 11 | 8 | | 2 | 16KB | 5 | 4 | | 1 | 8KB | 2 | 2 | | 0 | 4KB | 1 | 1 | So, for mem_cgroup, if page order > 1, then have optimize; while mem_cgroup_per_node needs order 2. :) > > So there is a packing advantage to turning off HWCACHE_ALIGN (for the > first slab; no difference for the second). BUT! Now you have cacheline > aliasing between two objects, and that's probably bad. It's the kind > of performance problem that's really hard to see. Yes, And I would like to learn, in what situations do you think HWCACHE UNALIGN might cause issues? Could it be direct memory reclaim by multiple processes? Or multiple processes charging memory simultaneously? > > Anyway, you've gone from allocating 8 objects per 32KiB to allocating > 13 objects per 32KiB, a 62% improvement in memory consumption. Thanks, that's more clearer. Huan
© 2016 - 2025 Red Hat, Inc.