memcg: reading memcg stats more efficiently

[PATCH v2 0/2] memcg: reading memcg stats more efficiently

Posted by JP Kobryn 3 months, 3 weeks ago

When reading cgroup memory.stat files there is significant kernel overhead
in the formatting and encoding of numeric data into a string buffer. Beyond
that, the given user mode program must decode this data and possibly
perform filtering to obtain the desired stats. This process can be
expensive for programs that periodically sample this data over a large
enough fleet.

As an alternative to reading memory.stat, introduce new kfuncs that allow
fetching specific memcg stats from within cgroup iterator based bpf
programs. This approach allows for numeric values to be transferred
directly from the kernel to user mode via the mapped memory of the bpf
program's elf data section. Reading stats this way effectively eliminates
the numeric conversion work needed to be performed in both kernel and user
mode. It also eliminates the need for filtering in a user mode program.
i.e. where reading memory.stat returns all stats, this new approach allows
returning only select stats.

An experiment was setup to compare the performance of a program using these
new kfuncs vs a program that uses the traditional method of reading
memory.stat. On the experimental side, a libbpf based program was written
which sets up a link to the bpf program once in advance and then reuses
this link to create and read from a bpf iterator program for 1M iterations.
Meanwhile on the control side, a program was written to open the root
memory.stat file and repeatedly read 1M times from the associated file
descriptor (while seeking back to zero before each subsequent read). Note
that the program does not bother to decode or filter any data in user mode.
The reason for this is because the experimental program completely removes
the need for this work.

The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 80% elapsed time in kernel
mode. The kernel overhead of numeric conversion on the control side is
eliminated on the experimental side since the values are read directly
through mapped memory of the bpf program. The experiment data is shown
here:

control: elapsed time
real    0m13.062s
user    0m0.147s
sys     0m12.876s

experiment: elapsed time
real    0m2.717s
user    0m0.175s
sys     0m2.451s

control: perf data
22.23% a.out [kernel.kallsyms] [k] vsnprintf
18.83% a.out [kernel.kallsyms] [k] format_decode
12.05% a.out [kernel.kallsyms] [k] string
11.56% a.out [kernel.kallsyms] [k] number
 7.71% a.out [kernel.kallsyms] [k] strlen
 4.80% a.out [kernel.kallsyms] [k] memcpy_orig
 4.67% a.out [kernel.kallsyms] [k] memory_stat_format
 4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
 2.22% a.out [kernel.kallsyms] [k] widen_string
 1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
 0.95% a.out [kernel.kallsyms] [k] put_dec_full8
 0.69% a.out [kernel.kallsyms] [k] put_dec
 0.69% a.out [kernel.kallsyms] [k] memcpy

experiment: perf data
10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
 7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
 4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
 3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
 2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
 2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
 2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
 2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
 2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
 2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist

Aside from the perf gain, the kfunc/bpf approach provides flexibility in
how memcg data can be delivered to a user mode program. As seen in the
second patch which contains the selftests, it is possible to use a struct
with select memory stat fields. But it is completely up to the programmer
on how to lay out the data.

JP Kobryn (2):
  memcg: introduce kfuncs for fetching memcg stats
  memcg: selftests for memcg stat kfuncs

 mm/memcontrol.c                               |  67 ++++
 .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 ++
 .../bpf/prog_tests/cgroup_iter_memcg.c        | 294 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_iter_memcg.c   |  61 ++++
 4 files changed, 440 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c

-- 
2.47.3

Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently

Posted by Shakeel Butt 3 months, 3 weeks ago

Cc memcg maintainers.

On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
> When reading cgroup memory.stat files there is significant kernel overhead
> in the formatting and encoding of numeric data into a string buffer. Beyond
> that, the given user mode program must decode this data and possibly
> perform filtering to obtain the desired stats. This process can be
> expensive for programs that periodically sample this data over a large
> enough fleet.
> 
> As an alternative to reading memory.stat, introduce new kfuncs that allow
> fetching specific memcg stats from within cgroup iterator based bpf
> programs. This approach allows for numeric values to be transferred
> directly from the kernel to user mode via the mapped memory of the bpf
> program's elf data section. Reading stats this way effectively eliminates
> the numeric conversion work needed to be performed in both kernel and user
> mode. It also eliminates the need for filtering in a user mode program.
> i.e. where reading memory.stat returns all stats, this new approach allows
> returning only select stats.
> 
> An experiment was setup to compare the performance of a program using these
> new kfuncs vs a program that uses the traditional method of reading
> memory.stat. On the experimental side, a libbpf based program was written
> which sets up a link to the bpf program once in advance and then reuses
> this link to create and read from a bpf iterator program for 1M iterations.

I am getting a bit confused on the terminology. You mentioned libbpf
program, bpf program, link. Can you describe each of them? Think of
explaining this to someone with no bpf background.

(BTW Yonghong already explained to me these details but I wanted the
commit message to be self explanatory).

> Meanwhile on the control side, a program was written to open the root
> memory.stat file

How much activity was on the system? I imagine none because I don't see
flushing in the perf profile. This experiment focuses on the
non-flushing part of the memcg stats which is fine.

> and repeatedly read 1M times from the associated file
> descriptor (while seeking back to zero before each subsequent read). Note
> that the program does not bother to decode or filter any data in user mode.
> The reason for this is because the experimental program completely removes
> the need for this work.

Hmm in your experiment is the control program doing the decode and/or
filter or no? The last sentence in above para is confusing. Yes, the
experiment program does not need to do the parsing or decoding in
userspace but the control program needs to do that. If your control
program is not doing it then you are under-selling your work.

> 
> The results showed a significant perf benefit on the experimental side,
> outperforming the control side by a margin of 80% elapsed time in kernel
> mode. The kernel overhead of numeric conversion on the control side is
> eliminated on the experimental side since the values are read directly
> through mapped memory of the bpf program. The experiment data is shown
> here:
> 
> control: elapsed time
> real    0m13.062s
> user    0m0.147s
> sys     0m12.876s
> 
> experiment: elapsed time
> real    0m2.717s
> user    0m0.175s
> sys     0m2.451s

These numbers are really awesome.

> 
> control: perf data
> 22.23% a.out [kernel.kallsyms] [k] vsnprintf
> 18.83% a.out [kernel.kallsyms] [k] format_decode
> 12.05% a.out [kernel.kallsyms] [k] string
> 11.56% a.out [kernel.kallsyms] [k] number
>  7.71% a.out [kernel.kallsyms] [k] strlen
>  4.80% a.out [kernel.kallsyms] [k] memcpy_orig
>  4.67% a.out [kernel.kallsyms] [k] memory_stat_format
>  4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
>  2.22% a.out [kernel.kallsyms] [k] widen_string
>  1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
>  0.95% a.out [kernel.kallsyms] [k] put_dec_full8
>  0.69% a.out [kernel.kallsyms] [k] put_dec
>  0.69% a.out [kernel.kallsyms] [k] memcpy
> 
> experiment: perf data
> 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
>  7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
>  4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
>  3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
>  2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
>  2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
>  2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
>  2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
>  2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
>  2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
> 
> Aside from the perf gain, the kfunc/bpf approach provides flexibility in
> how memcg data can be delivered to a user mode program. As seen in the
> second patch which contains the selftests, it is possible to use a struct
> with select memory stat fields. But it is completely up to the programmer
> on how to lay out the data.

I remember you plan to convert couple of open source program to use this
new feature. I think below [1] and oomd [2]. Adding that information
would further make your case strong. cAdvisor[3] is another open source
tool which can take benefit from this work.

[1] https://github.com/facebookincubator/below
[2] https://github.com/facebookincubator/oomd
[3] https://github.com/google/cadvisor

Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently

Posted by JP Kobryn 3 months, 3 weeks ago

On 10/15/25 1:46 PM, Shakeel Butt wrote:
> Cc memcg maintainers.
> 
> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>> When reading cgroup memory.stat files there is significant kernel overhead
>> in the formatting and encoding of numeric data into a string buffer. Beyond
>> that, the given user mode program must decode this data and possibly
>> perform filtering to obtain the desired stats. This process can be
>> expensive for programs that periodically sample this data over a large
>> enough fleet.
>>
>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>> fetching specific memcg stats from within cgroup iterator based bpf
>> programs. This approach allows for numeric values to be transferred
>> directly from the kernel to user mode via the mapped memory of the bpf
>> program's elf data section. Reading stats this way effectively eliminates
>> the numeric conversion work needed to be performed in both kernel and user
>> mode. It also eliminates the need for filtering in a user mode program.
>> i.e. where reading memory.stat returns all stats, this new approach allows
>> returning only select stats.
>>
>> An experiment was setup to compare the performance of a program using these
>> new kfuncs vs a program that uses the traditional method of reading
>> memory.stat. On the experimental side, a libbpf based program was written
>> which sets up a link to the bpf program once in advance and then reuses
>> this link to create and read from a bpf iterator program for 1M iterations.
> 
> I am getting a bit confused on the terminology. You mentioned libbpf
> program, bpf program, link. Can you describe each of them? Think of
> explaining this to someone with no bpf background.
> 
> (BTW Yonghong already explained to me these details but I wanted the
> commit message to be self explanatory).

No problem. I'll try to expand on those terms in v3.

> 
>> Meanwhile on the control side, a program was written to open the root
>> memory.stat file
> 
> How much activity was on the system? I imagine none because I don't see
> flushing in the perf profile. This experiment focuses on the
> non-flushing part of the memcg stats which is fine.

Right, at the time there was no custom workload running alongside the
tests.

> 
>> and repeatedly read 1M times from the associated file
>> descriptor (while seeking back to zero before each subsequent read). Note
>> that the program does not bother to decode or filter any data in user mode.
>> The reason for this is because the experimental program completely removes
>> the need for this work.
> 
> Hmm in your experiment is the control program doing the decode and/or
> filter or no? The last sentence in above para is confusing. Yes, the
> experiment program does not need to do the parsing or decoding in
> userspace but the control program needs to do that. If your control
> program is not doing it then you are under-selling your work.

The control does not perform decoding. But it's a good point. Let me add
decoding to the control side in v3.

> 
>>
>> The results showed a significant perf benefit on the experimental side,
>> outperforming the control side by a margin of 80% elapsed time in kernel
>> mode. The kernel overhead of numeric conversion on the control side is
>> eliminated on the experimental side since the values are read directly
>> through mapped memory of the bpf program. The experiment data is shown
>> here:
>>
>> control: elapsed time
>> real    0m13.062s
>> user    0m0.147s
>> sys     0m12.876s
>>
>> experiment: elapsed time
>> real    0m2.717s
>> user    0m0.175s
>> sys     0m2.451s
> 
> These numbers are really awesome.

:)

> 
>>
>> control: perf data
>> 22.23% a.out [kernel.kallsyms] [k] vsnprintf
>> 18.83% a.out [kernel.kallsyms] [k] format_decode
>> 12.05% a.out [kernel.kallsyms] [k] string
>> 11.56% a.out [kernel.kallsyms] [k] number
>>   7.71% a.out [kernel.kallsyms] [k] strlen
>>   4.80% a.out [kernel.kallsyms] [k] memcpy_orig
>>   4.67% a.out [kernel.kallsyms] [k] memory_stat_format
>>   4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
>>   2.22% a.out [kernel.kallsyms] [k] widen_string
>>   1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
>>   0.95% a.out [kernel.kallsyms] [k] put_dec_full8
>>   0.69% a.out [kernel.kallsyms] [k] put_dec
>>   0.69% a.out [kernel.kallsyms] [k] memcpy
>>
>> experiment: perf data
>> 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
>>   7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
>>   4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
>>   3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
>>   2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
>>   2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
>>   2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
>>   2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
>>   2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
>>   2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
>>
>> Aside from the perf gain, the kfunc/bpf approach provides flexibility in
>> how memcg data can be delivered to a user mode program. As seen in the
>> second patch which contains the selftests, it is possible to use a struct
>> with select memory stat fields. But it is completely up to the programmer
>> on how to lay out the data.
> 
> I remember you plan to convert couple of open source program to use this
> new feature. I think below [1] and oomd [2]. Adding that information
> would further make your case strong. cAdvisor[3] is another open source
> tool which can take benefit from this work.

That is accurate, thanks. Will include in v3.

> 
> [1] https://github.com/facebookincubator/below
> [2] https://github.com/facebookincubator/oomd
> [3] https://github.com/google/cadvisor
>

Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently

Posted by Roman Gushchin 3 months, 3 weeks ago

JP Kobryn <inwardvessel@gmail.com> writes:

> On 10/15/25 1:46 PM, Shakeel Butt wrote:
>> Cc memcg maintainers.
>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>>> When reading cgroup memory.stat files there is significant kernel overhead
>>> in the formatting and encoding of numeric data into a string buffer. Beyond
>>> that, the given user mode program must decode this data and possibly
>>> perform filtering to obtain the desired stats. This process can be
>>> expensive for programs that periodically sample this data over a large
>>> enough fleet.
>>>
>>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>>> fetching specific memcg stats from within cgroup iterator based bpf
>>> programs. This approach allows for numeric values to be transferred
>>> directly from the kernel to user mode via the mapped memory of the bpf
>>> program's elf data section. Reading stats this way effectively eliminates
>>> the numeric conversion work needed to be performed in both kernel and user
>>> mode. It also eliminates the need for filtering in a user mode program.
>>> i.e. where reading memory.stat returns all stats, this new approach allows
>>> returning only select stats.

It seems like I've most of these functions implemented as part of
bpfoom: https://lkml.org/lkml/2025/8/18/1403

So I definitely find them useful. Would be nice to merge our efforts.

Thanks!

Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently

Posted by JP Kobryn 3 months, 3 weeks ago

On 10/15/25 6:10 PM, Roman Gushchin wrote:
> JP Kobryn <inwardvessel@gmail.com> writes:
> 
>> On 10/15/25 1:46 PM, Shakeel Butt wrote:
>>> Cc memcg maintainers.
>>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>>>> When reading cgroup memory.stat files there is significant kernel overhead
>>>> in the formatting and encoding of numeric data into a string buffer. Beyond
>>>> that, the given user mode program must decode this data and possibly
>>>> perform filtering to obtain the desired stats. This process can be
>>>> expensive for programs that periodically sample this data over a large
>>>> enough fleet.
>>>>
>>>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>>>> fetching specific memcg stats from within cgroup iterator based bpf
>>>> programs. This approach allows for numeric values to be transferred
>>>> directly from the kernel to user mode via the mapped memory of the bpf
>>>> program's elf data section. Reading stats this way effectively eliminates
>>>> the numeric conversion work needed to be performed in both kernel and user
>>>> mode. It also eliminates the need for filtering in a user mode program.
>>>> i.e. where reading memory.stat returns all stats, this new approach allows
>>>> returning only select stats.
> 
> It seems like I've most of these functions implemented as part of
> bpfoom: https://lkml.org/lkml/2025/8/18/1403
> 
> So I definitely find them useful. Would be nice to merge our efforts.

Sounds great. I see in your series that you allow the kfuncs to accept
integers as item numbers. Would my approach of using typed enums work
for you? I wanted to take advantage of libbpf core so that the bpf
program could gracefully handle cases where a given enumerator is not
present in a given kernel version. I made use of this in the selftests.

I'm planning on sending out a v3 so let me know if you would like to see
any alterations that would align with bpfoom.

Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently

Posted by Roman Gushchin 3 months, 3 weeks ago

JP Kobryn <inwardvessel@gmail.com> writes:

> On 10/15/25 6:10 PM, Roman Gushchin wrote:
>> JP Kobryn <inwardvessel@gmail.com> writes:
>> 
>>> On 10/15/25 1:46 PM, Shakeel Butt wrote:
>>>> Cc memcg maintainers.
>>>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>>>>> When reading cgroup memory.stat files there is significant kernel overhead
>>>>> in the formatting and encoding of numeric data into a string buffer. Beyond
>>>>> that, the given user mode program must decode this data and possibly
>>>>> perform filtering to obtain the desired stats. This process can be
>>>>> expensive for programs that periodically sample this data over a large
>>>>> enough fleet.
>>>>>
>>>>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>>>>> fetching specific memcg stats from within cgroup iterator based bpf
>>>>> programs. This approach allows for numeric values to be transferred
>>>>> directly from the kernel to user mode via the mapped memory of the bpf
>>>>> program's elf data section. Reading stats this way effectively eliminates
>>>>> the numeric conversion work needed to be performed in both kernel and user
>>>>> mode. It also eliminates the need for filtering in a user mode program.
>>>>> i.e. where reading memory.stat returns all stats, this new approach allows
>>>>> returning only select stats.
>> It seems like I've most of these functions implemented as part of
>> bpfoom: https://lkml.org/lkml/2025/8/18/1403
>> So I definitely find them useful. Would be nice to merge our
>> efforts.
>
> Sounds great. I see in your series that you allow the kfuncs to accept
> integers as item numbers. Would my approach of using typed enums work
> for you? I wanted to take advantage of libbpf core so that the bpf
> program could gracefully handle cases where a given enumerator is not
> present in a given kernel version. I made use of this in the
> selftests.

Good point, I'm going to change it in the next version, which I'm about
to send out: tomorrow or early next week.

> I'm planning on sending out a v3 so let me know if you would like to see
> any alterations that would align with bpfoom.

I kinda prefer my version regarding taking a memcg argument instead of cgroup
and also regarding naming. I also think it's safer to expose the
rate-limited version of stats flushing function. But I do lack the
node-level statistics (which I don't need)

If it's ok with you, maybe you can rebase your patches on top of my v2
and I can include your patches in the series?

Thanks!