mm/memcontrol.c | 67 ++++ .../testing/selftests/bpf/cgroup_iter_memcg.h | 18 ++ .../bpf/prog_tests/cgroup_iter_memcg.c | 294 ++++++++++++++++++ .../selftests/bpf/progs/cgroup_iter_memcg.c | 61 ++++ 4 files changed, 440 insertions(+) create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
When reading cgroup memory.stat files there is significant kernel overhead in the formatting and encoding of numeric data into a string buffer. Beyond that, the given user mode program must decode this data and possibly perform filtering to obtain the desired stats. This process can be expensive for programs that periodically sample this data over a large enough fleet. As an alternative to reading memory.stat, introduce new kfuncs that allow fetching specific memcg stats from within cgroup iterator based bpf programs. This approach allows for numeric values to be transferred directly from the kernel to user mode via the mapped memory of the bpf program's elf data section. Reading stats this way effectively eliminates the numeric conversion work needed to be performed in both kernel and user mode. It also eliminates the need for filtering in a user mode program. i.e. where reading memory.stat returns all stats, this new approach allows returning only select stats. An experiment was setup to compare the performance of a program using these new kfuncs vs a program that uses the traditional method of reading memory.stat. On the experimental side, a libbpf based program was written which sets up a link to the bpf program once in advance and then reuses this link to create and read from a bpf iterator program for 1M iterations. Meanwhile on the control side, a program was written to open the root memory.stat file and repeatedly read 1M times from the associated file descriptor (while seeking back to zero before each subsequent read). Note that the program does not bother to decode or filter any data in user mode. The reason for this is because the experimental program completely removes the need for this work. The results showed a significant perf benefit on the experimental side, outperforming the control side by a margin of 80% elapsed time in kernel mode. The kernel overhead of numeric conversion on the control side is eliminated on the experimental side since the values are read directly through mapped memory of the bpf program. The experiment data is shown here: control: elapsed time real 0m13.062s user 0m0.147s sys 0m12.876s experiment: elapsed time real 0m2.717s user 0m0.175s sys 0m2.451s control: perf data 22.23% a.out [kernel.kallsyms] [k] vsnprintf 18.83% a.out [kernel.kallsyms] [k] format_decode 12.05% a.out [kernel.kallsyms] [k] string 11.56% a.out [kernel.kallsyms] [k] number 7.71% a.out [kernel.kallsyms] [k] strlen 4.80% a.out [kernel.kallsyms] [k] memcpy_orig 4.67% a.out [kernel.kallsyms] [k] memory_stat_format 4.63% a.out [kernel.kallsyms] [k] seq_buf_printf 2.22% a.out [kernel.kallsyms] [k] widen_string 1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8 0.95% a.out [kernel.kallsyms] [k] put_dec_full8 0.69% a.out [kernel.kallsyms] [k] put_dec 0.69% a.out [kernel.kallsyms] [k] memcpy experiment: perf data 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query 7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch 4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook 3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock 2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch 2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack 2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free 2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook 2.13% memcgstat [kernel.kallsyms] [k] mutex_lock 2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist Aside from the perf gain, the kfunc/bpf approach provides flexibility in how memcg data can be delivered to a user mode program. As seen in the second patch which contains the selftests, it is possible to use a struct with select memory stat fields. But it is completely up to the programmer on how to lay out the data. JP Kobryn (2): memcg: introduce kfuncs for fetching memcg stats memcg: selftests for memcg stat kfuncs mm/memcontrol.c | 67 ++++ .../testing/selftests/bpf/cgroup_iter_memcg.h | 18 ++ .../bpf/prog_tests/cgroup_iter_memcg.c | 294 ++++++++++++++++++ .../selftests/bpf/progs/cgroup_iter_memcg.c | 61 ++++ 4 files changed, 440 insertions(+) create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c -- 2.47.3
Cc memcg maintainers. On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote: > When reading cgroup memory.stat files there is significant kernel overhead > in the formatting and encoding of numeric data into a string buffer. Beyond > that, the given user mode program must decode this data and possibly > perform filtering to obtain the desired stats. This process can be > expensive for programs that periodically sample this data over a large > enough fleet. > > As an alternative to reading memory.stat, introduce new kfuncs that allow > fetching specific memcg stats from within cgroup iterator based bpf > programs. This approach allows for numeric values to be transferred > directly from the kernel to user mode via the mapped memory of the bpf > program's elf data section. Reading stats this way effectively eliminates > the numeric conversion work needed to be performed in both kernel and user > mode. It also eliminates the need for filtering in a user mode program. > i.e. where reading memory.stat returns all stats, this new approach allows > returning only select stats. > > An experiment was setup to compare the performance of a program using these > new kfuncs vs a program that uses the traditional method of reading > memory.stat. On the experimental side, a libbpf based program was written > which sets up a link to the bpf program once in advance and then reuses > this link to create and read from a bpf iterator program for 1M iterations. I am getting a bit confused on the terminology. You mentioned libbpf program, bpf program, link. Can you describe each of them? Think of explaining this to someone with no bpf background. (BTW Yonghong already explained to me these details but I wanted the commit message to be self explanatory). > Meanwhile on the control side, a program was written to open the root > memory.stat file How much activity was on the system? I imagine none because I don't see flushing in the perf profile. This experiment focuses on the non-flushing part of the memcg stats which is fine. > and repeatedly read 1M times from the associated file > descriptor (while seeking back to zero before each subsequent read). Note > that the program does not bother to decode or filter any data in user mode. > The reason for this is because the experimental program completely removes > the need for this work. Hmm in your experiment is the control program doing the decode and/or filter or no? The last sentence in above para is confusing. Yes, the experiment program does not need to do the parsing or decoding in userspace but the control program needs to do that. If your control program is not doing it then you are under-selling your work. > > The results showed a significant perf benefit on the experimental side, > outperforming the control side by a margin of 80% elapsed time in kernel > mode. The kernel overhead of numeric conversion on the control side is > eliminated on the experimental side since the values are read directly > through mapped memory of the bpf program. The experiment data is shown > here: > > control: elapsed time > real 0m13.062s > user 0m0.147s > sys 0m12.876s > > experiment: elapsed time > real 0m2.717s > user 0m0.175s > sys 0m2.451s These numbers are really awesome. > > control: perf data > 22.23% a.out [kernel.kallsyms] [k] vsnprintf > 18.83% a.out [kernel.kallsyms] [k] format_decode > 12.05% a.out [kernel.kallsyms] [k] string > 11.56% a.out [kernel.kallsyms] [k] number > 7.71% a.out [kernel.kallsyms] [k] strlen > 4.80% a.out [kernel.kallsyms] [k] memcpy_orig > 4.67% a.out [kernel.kallsyms] [k] memory_stat_format > 4.63% a.out [kernel.kallsyms] [k] seq_buf_printf > 2.22% a.out [kernel.kallsyms] [k] widen_string > 1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8 > 0.95% a.out [kernel.kallsyms] [k] put_dec_full8 > 0.69% a.out [kernel.kallsyms] [k] put_dec > 0.69% a.out [kernel.kallsyms] [k] memcpy > > experiment: perf data > 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query > 7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch > 4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook > 3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock > 2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch > 2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack > 2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free > 2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook > 2.13% memcgstat [kernel.kallsyms] [k] mutex_lock > 2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist > > Aside from the perf gain, the kfunc/bpf approach provides flexibility in > how memcg data can be delivered to a user mode program. As seen in the > second patch which contains the selftests, it is possible to use a struct > with select memory stat fields. But it is completely up to the programmer > on how to lay out the data. I remember you plan to convert couple of open source program to use this new feature. I think below [1] and oomd [2]. Adding that information would further make your case strong. cAdvisor[3] is another open source tool which can take benefit from this work. [1] https://github.com/facebookincubator/below [2] https://github.com/facebookincubator/oomd [3] https://github.com/google/cadvisor
On 10/15/25 1:46 PM, Shakeel Butt wrote: > Cc memcg maintainers. > > On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote: >> When reading cgroup memory.stat files there is significant kernel overhead >> in the formatting and encoding of numeric data into a string buffer. Beyond >> that, the given user mode program must decode this data and possibly >> perform filtering to obtain the desired stats. This process can be >> expensive for programs that periodically sample this data over a large >> enough fleet. >> >> As an alternative to reading memory.stat, introduce new kfuncs that allow >> fetching specific memcg stats from within cgroup iterator based bpf >> programs. This approach allows for numeric values to be transferred >> directly from the kernel to user mode via the mapped memory of the bpf >> program's elf data section. Reading stats this way effectively eliminates >> the numeric conversion work needed to be performed in both kernel and user >> mode. It also eliminates the need for filtering in a user mode program. >> i.e. where reading memory.stat returns all stats, this new approach allows >> returning only select stats. >> >> An experiment was setup to compare the performance of a program using these >> new kfuncs vs a program that uses the traditional method of reading >> memory.stat. On the experimental side, a libbpf based program was written >> which sets up a link to the bpf program once in advance and then reuses >> this link to create and read from a bpf iterator program for 1M iterations. > > I am getting a bit confused on the terminology. You mentioned libbpf > program, bpf program, link. Can you describe each of them? Think of > explaining this to someone with no bpf background. > > (BTW Yonghong already explained to me these details but I wanted the > commit message to be self explanatory). No problem. I'll try to expand on those terms in v3. > >> Meanwhile on the control side, a program was written to open the root >> memory.stat file > > How much activity was on the system? I imagine none because I don't see > flushing in the perf profile. This experiment focuses on the > non-flushing part of the memcg stats which is fine. Right, at the time there was no custom workload running alongside the tests. > >> and repeatedly read 1M times from the associated file >> descriptor (while seeking back to zero before each subsequent read). Note >> that the program does not bother to decode or filter any data in user mode. >> The reason for this is because the experimental program completely removes >> the need for this work. > > Hmm in your experiment is the control program doing the decode and/or > filter or no? The last sentence in above para is confusing. Yes, the > experiment program does not need to do the parsing or decoding in > userspace but the control program needs to do that. If your control > program is not doing it then you are under-selling your work. The control does not perform decoding. But it's a good point. Let me add decoding to the control side in v3. > >> >> The results showed a significant perf benefit on the experimental side, >> outperforming the control side by a margin of 80% elapsed time in kernel >> mode. The kernel overhead of numeric conversion on the control side is >> eliminated on the experimental side since the values are read directly >> through mapped memory of the bpf program. The experiment data is shown >> here: >> >> control: elapsed time >> real 0m13.062s >> user 0m0.147s >> sys 0m12.876s >> >> experiment: elapsed time >> real 0m2.717s >> user 0m0.175s >> sys 0m2.451s > > These numbers are really awesome. :) > >> >> control: perf data >> 22.23% a.out [kernel.kallsyms] [k] vsnprintf >> 18.83% a.out [kernel.kallsyms] [k] format_decode >> 12.05% a.out [kernel.kallsyms] [k] string >> 11.56% a.out [kernel.kallsyms] [k] number >> 7.71% a.out [kernel.kallsyms] [k] strlen >> 4.80% a.out [kernel.kallsyms] [k] memcpy_orig >> 4.67% a.out [kernel.kallsyms] [k] memory_stat_format >> 4.63% a.out [kernel.kallsyms] [k] seq_buf_printf >> 2.22% a.out [kernel.kallsyms] [k] widen_string >> 1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8 >> 0.95% a.out [kernel.kallsyms] [k] put_dec_full8 >> 0.69% a.out [kernel.kallsyms] [k] put_dec >> 0.69% a.out [kernel.kallsyms] [k] memcpy >> >> experiment: perf data >> 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query >> 7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch >> 4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook >> 3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock >> 2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch >> 2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack >> 2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free >> 2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook >> 2.13% memcgstat [kernel.kallsyms] [k] mutex_lock >> 2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist >> >> Aside from the perf gain, the kfunc/bpf approach provides flexibility in >> how memcg data can be delivered to a user mode program. As seen in the >> second patch which contains the selftests, it is possible to use a struct >> with select memory stat fields. But it is completely up to the programmer >> on how to lay out the data. > > I remember you plan to convert couple of open source program to use this > new feature. I think below [1] and oomd [2]. Adding that information > would further make your case strong. cAdvisor[3] is another open source > tool which can take benefit from this work. That is accurate, thanks. Will include in v3. > > [1] https://github.com/facebookincubator/below > [2] https://github.com/facebookincubator/oomd > [3] https://github.com/google/cadvisor >
JP Kobryn <inwardvessel@gmail.com> writes: > On 10/15/25 1:46 PM, Shakeel Butt wrote: >> Cc memcg maintainers. >> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote: >>> When reading cgroup memory.stat files there is significant kernel overhead >>> in the formatting and encoding of numeric data into a string buffer. Beyond >>> that, the given user mode program must decode this data and possibly >>> perform filtering to obtain the desired stats. This process can be >>> expensive for programs that periodically sample this data over a large >>> enough fleet. >>> >>> As an alternative to reading memory.stat, introduce new kfuncs that allow >>> fetching specific memcg stats from within cgroup iterator based bpf >>> programs. This approach allows for numeric values to be transferred >>> directly from the kernel to user mode via the mapped memory of the bpf >>> program's elf data section. Reading stats this way effectively eliminates >>> the numeric conversion work needed to be performed in both kernel and user >>> mode. It also eliminates the need for filtering in a user mode program. >>> i.e. where reading memory.stat returns all stats, this new approach allows >>> returning only select stats. It seems like I've most of these functions implemented as part of bpfoom: https://lkml.org/lkml/2025/8/18/1403 So I definitely find them useful. Would be nice to merge our efforts. Thanks!
On 10/15/25 6:10 PM, Roman Gushchin wrote: > JP Kobryn <inwardvessel@gmail.com> writes: > >> On 10/15/25 1:46 PM, Shakeel Butt wrote: >>> Cc memcg maintainers. >>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote: >>>> When reading cgroup memory.stat files there is significant kernel overhead >>>> in the formatting and encoding of numeric data into a string buffer. Beyond >>>> that, the given user mode program must decode this data and possibly >>>> perform filtering to obtain the desired stats. This process can be >>>> expensive for programs that periodically sample this data over a large >>>> enough fleet. >>>> >>>> As an alternative to reading memory.stat, introduce new kfuncs that allow >>>> fetching specific memcg stats from within cgroup iterator based bpf >>>> programs. This approach allows for numeric values to be transferred >>>> directly from the kernel to user mode via the mapped memory of the bpf >>>> program's elf data section. Reading stats this way effectively eliminates >>>> the numeric conversion work needed to be performed in both kernel and user >>>> mode. It also eliminates the need for filtering in a user mode program. >>>> i.e. where reading memory.stat returns all stats, this new approach allows >>>> returning only select stats. > > It seems like I've most of these functions implemented as part of > bpfoom: https://lkml.org/lkml/2025/8/18/1403 > > So I definitely find them useful. Would be nice to merge our efforts. Sounds great. I see in your series that you allow the kfuncs to accept integers as item numbers. Would my approach of using typed enums work for you? I wanted to take advantage of libbpf core so that the bpf program could gracefully handle cases where a given enumerator is not present in a given kernel version. I made use of this in the selftests. I'm planning on sending out a v3 so let me know if you would like to see any alterations that would align with bpfoom.
JP Kobryn <inwardvessel@gmail.com> writes: > On 10/15/25 6:10 PM, Roman Gushchin wrote: >> JP Kobryn <inwardvessel@gmail.com> writes: >> >>> On 10/15/25 1:46 PM, Shakeel Butt wrote: >>>> Cc memcg maintainers. >>>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote: >>>>> When reading cgroup memory.stat files there is significant kernel overhead >>>>> in the formatting and encoding of numeric data into a string buffer. Beyond >>>>> that, the given user mode program must decode this data and possibly >>>>> perform filtering to obtain the desired stats. This process can be >>>>> expensive for programs that periodically sample this data over a large >>>>> enough fleet. >>>>> >>>>> As an alternative to reading memory.stat, introduce new kfuncs that allow >>>>> fetching specific memcg stats from within cgroup iterator based bpf >>>>> programs. This approach allows for numeric values to be transferred >>>>> directly from the kernel to user mode via the mapped memory of the bpf >>>>> program's elf data section. Reading stats this way effectively eliminates >>>>> the numeric conversion work needed to be performed in both kernel and user >>>>> mode. It also eliminates the need for filtering in a user mode program. >>>>> i.e. where reading memory.stat returns all stats, this new approach allows >>>>> returning only select stats. >> It seems like I've most of these functions implemented as part of >> bpfoom: https://lkml.org/lkml/2025/8/18/1403 >> So I definitely find them useful. Would be nice to merge our >> efforts. > > Sounds great. I see in your series that you allow the kfuncs to accept > integers as item numbers. Would my approach of using typed enums work > for you? I wanted to take advantage of libbpf core so that the bpf > program could gracefully handle cases where a given enumerator is not > present in a given kernel version. I made use of this in the > selftests. Good point, I'm going to change it in the next version, which I'm about to send out: tomorrow or early next week. > I'm planning on sending out a v3 so let me know if you would like to see > any alterations that would align with bpfoom. I kinda prefer my version regarding taking a memcg argument instead of cgroup and also regarding naming. I also think it's safer to expose the rate-limited version of stats flushing function. But I do lack the node-level statistics (which I don't need) If it's ok with you, maybe you can rebase your patches on top of my v2 and I can include your patches in the series? Thanks!
© 2016 - 2025 Red Hat, Inc.