Try to add memory allocation info for cgroup oom kill

[RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Yueyang Pan 1 month, 2 weeks ago

Right now in the oom_kill_process if the oom is because of the cgroup 
limit, we won't get memory allocation infomation. In some cases, we 
can have a large cgroup workload running which dominates the machine. 
The reason using cgroup is to leave some resource for system. When this 
cgroup is killed, we would also like to have some memory allocation 
information for the whole server as well. This is reason behind this 
mini change. Is it an acceptable thing to do? Will it be too much 
information for people? I am happy with any suggestions!

Yueyang Pan (1):
  Add memory allocation info for cgroup oom

 mm/oom_kill.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
2.47.3

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Shakeel Butt 1 month, 1 week ago

On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> Right now in the oom_kill_process if the oom is because of the cgroup 
> limit, we won't get memory allocation infomation. In some cases, we 
> can have a large cgroup workload running which dominates the machine. 
> The reason using cgroup is to leave some resource for system. When this 
> cgroup is killed, we would also like to have some memory allocation 
> information for the whole server as well. This is reason behind this 
> mini change. Is it an acceptable thing to do? Will it be too much 
> information for people? I am happy with any suggestions!

For a single patch, it is better to have all the context in the patch
and there is no need for cover letter.

What exact information you want on the memcg oom that will be helpful
for the users in general? You mentioned memory allocation information,
can you please elaborate a bit more.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Yueyang Pan 1 month, 1 week ago

On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > Right now in the oom_kill_process if the oom is because of the cgroup 
> > limit, we won't get memory allocation infomation. In some cases, we 
> > can have a large cgroup workload running which dominates the machine. 
> > The reason using cgroup is to leave some resource for system. When this 
> > cgroup is killed, we would also like to have some memory allocation 
> > information for the whole server as well. This is reason behind this 
> > mini change. Is it an acceptable thing to do? Will it be too much 
> > information for people? I am happy with any suggestions!
> 
> For a single patch, it is better to have all the context in the patch
> and there is no need for cover letter.

Thanks for your suggestion Shakeel! I will change this in the next version.

> 
> What exact information you want on the memcg oom that will be helpful
> for the users in general? You mentioned memory allocation information,
> can you please elaborate a bit more.
> 

As in my reply to Suren, I was thinking the system-wide memory usage info 
provided by show_free_pages and memory allocation profiling info can help 
us debug cgoom by comparing them with historical data. What is your take on 
this?

Thanks,
Pan

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Shakeel Butt 1 month, 1 week ago

On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > Right now in the oom_kill_process if the oom is because of the cgroup 
> > > limit, we won't get memory allocation infomation. In some cases, we 
> > > can have a large cgroup workload running which dominates the machine. 
> > > The reason using cgroup is to leave some resource for system. When this 
> > > cgroup is killed, we would also like to have some memory allocation 
> > > information for the whole server as well. This is reason behind this 
> > > mini change. Is it an acceptable thing to do? Will it be too much 
> > > information for people? I am happy with any suggestions!
> > 
> > For a single patch, it is better to have all the context in the patch
> > and there is no need for cover letter.
> 
> Thanks for your suggestion Shakeel! I will change this in the next version.
> 
> > 
> > What exact information you want on the memcg oom that will be helpful
> > for the users in general? You mentioned memory allocation information,
> > can you please elaborate a bit more.
> > 
> 
> As in my reply to Suren, I was thinking the system-wide memory usage info 
> provided by show_free_pages and memory allocation profiling info can help 
> us debug cgoom by comparing them with historical data. What is your take on 
> this?
> 

I am not really sure about show_free_areas(). More specifically how the
historical data diff will be useful for a memcg oom. If you have a
concrete example, please give one. For memory allocation profiling, is
it possible to filter for the given memcg? Do we save memcg information
in the memory allocation profiling?

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Suren Baghdasaryan 1 month, 1 week ago

On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > can have a large cgroup workload running which dominates the machine.
> > > > The reason using cgroup is to leave some resource for system. When this
> > > > cgroup is killed, we would also like to have some memory allocation
> > > > information for the whole server as well. This is reason behind this
> > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > information for people? I am happy with any suggestions!
> > >
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> >
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> >
> > >
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > >
> >
> > As in my reply to Suren, I was thinking the system-wide memory usage info
> > provided by show_free_pages and memory allocation profiling info can help
> > us debug cgoom by comparing them with historical data. What is your take on
> > this?
> >
>
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is
> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?

Actually I was thinking about making memory profiling memcg-aware but
it would be quite costly both from memory and performance points of
view. Currently we have a per-cpu counter for each allocation in the
kernel codebase. To make it work for each memcg we would have to add
memcg dimension to the counters, so each counter becomes per-cpu plus
per-memcg. I'll be thinking about possible optimizations since many of
these counters will stay at 0 but any such optimization would come at
a performance cost, which we tried to keep at the absolute minimum.

I'm CC'ing Sourav and Pasha since they were also interested in making
memory allocation profiling memcg-aware. Would Meta folks (Usama,
Shakeel, Johannes) be interested in such enhancement as well? Would it
be preferable to have such accounting for a specific memcg which we
pre-select (less memory and performance overhead) or we need that for
all memcgs as a generic feature? We have some options here but I want
to understand what would be sufficient and add as little overhead as
possible.
Thanks,
Suren.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Shakeel Butt 1 month, 1 week ago

On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
> 
> Actually I was thinking about making memory profiling memcg-aware but
> it would be quite costly both from memory and performance points of
> view. Currently we have a per-cpu counter for each allocation in the
> kernel codebase. To make it work for each memcg we would have to add
> memcg dimension to the counters, so each counter becomes per-cpu plus
> per-memcg. I'll be thinking about possible optimizations since many of
> these counters will stay at 0 but any such optimization would come at
> a performance cost, which we tried to keep at the absolute minimum.
> 
> I'm CC'ing Sourav and Pasha since they were also interested in making
> memory allocation profiling memcg-aware. Would Meta folks (Usama,
> Shakeel, Johannes) be interested in such enhancement as well? Would it
> be preferable to have such accounting for a specific memcg which we
> pre-select (less memory and performance overhead) or we need that for
> all memcgs as a generic feature? We have some options here but I want
> to understand what would be sufficient and add as little overhead as
> possible.

Thanks Suren, yes, as already mentioned by Usama, Meta will be
interested in memcg aware allocation profiling. I would say start simple
and as little overhead as possible. More functionality can be added
later when the need arises. Maybe the first useful addition is just
adding how many allocations for a specific allocation site are memcg
charged.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Suren Baghdasaryan 3 weeks, 6 days ago

On Wed, Aug 27, 2025 at 2:15 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> > > it possible to filter for the given memcg? Do we save memcg information
> > > in the memory allocation profiling?
> >
> > Actually I was thinking about making memory profiling memcg-aware but
> > it would be quite costly both from memory and performance points of
> > view. Currently we have a per-cpu counter for each allocation in the
> > kernel codebase. To make it work for each memcg we would have to add
> > memcg dimension to the counters, so each counter becomes per-cpu plus
> > per-memcg. I'll be thinking about possible optimizations since many of
> > these counters will stay at 0 but any such optimization would come at
> > a performance cost, which we tried to keep at the absolute minimum.
> >
> > I'm CC'ing Sourav and Pasha since they were also interested in making
> > memory allocation profiling memcg-aware. Would Meta folks (Usama,
> > Shakeel, Johannes) be interested in such enhancement as well? Would it
> > be preferable to have such accounting for a specific memcg which we
> > pre-select (less memory and performance overhead) or we need that for
> > all memcgs as a generic feature? We have some options here but I want
> > to understand what would be sufficient and add as little overhead as
> > possible.
>
> Thanks Suren, yes, as already mentioned by Usama, Meta will be
> interested in memcg aware allocation profiling. I would say start simple
> and as little overhead as possible. More functionality can be added
> later when the need arises. Maybe the first useful addition is just
> adding how many allocations for a specific allocation site are memcg
> charged.

Adding back Sourav, Pasha and Johannes who got accidentally dropped in
the replies.

I looked a bit into adding memcg-awareness into memory allocation
profiling and it's more complicated than I first thought (as usual).
The main complication is that we need to add memcg_id or some other
memcg identifier into codetag_ref. That's needed so that we can
unaccount the correct memcg when we free an allocation - that's the
usual function of the codetag_ref. Now, extending codetag_ref is not a
problem by itself but when we use mem_profiling_compressed mode, we
store an index of the codetag instead of codetag_ref in the unused
page flag bits. This is useful optimization to avoid using page_ext
and overhead associated with it. So, full blown memcg support seems
problematic.

What I'm thinking is easily doable is a filtering interface where we
could select a specific memcg to be profiled, IOW we profile only
allocations from a chosen memcg. Filtering can be done using ioctl
interface on /proc/allocinfo, which can be used for other things as
well, like filtering non-zero allocations, returning per-NUMA node
information, etc. I see that Damon uses similar memcg filtering (see
damos_filter.memcg_id), so I can reuse some of that code for
implementing this facility. From high-level, userspace will be able to
select one memcg at a time to be profiled. At some later time
profiling information is gathered and another memcg can be selected or
filtering can be reset to profile all allocations from all memcgs. I
expect overhead for this kind of memcg filtering to be quite low. WDYT
folks, would this be helpful and cover your usecases?

>

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Usama Arif 1 month, 1 week ago


On 27/08/2025 03:32, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
>>> On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
>>>> On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
>>>>> Right now in the oom_kill_process if the oom is because of the cgroup
>>>>> limit, we won't get memory allocation infomation. In some cases, we
>>>>> can have a large cgroup workload running which dominates the machine.
>>>>> The reason using cgroup is to leave some resource for system. When this
>>>>> cgroup is killed, we would also like to have some memory allocation
>>>>> information for the whole server as well. This is reason behind this
>>>>> mini change. Is it an acceptable thing to do? Will it be too much
>>>>> information for people? I am happy with any suggestions!
>>>>
>>>> For a single patch, it is better to have all the context in the patch
>>>> and there is no need for cover letter.
>>>
>>> Thanks for your suggestion Shakeel! I will change this in the next version.
>>>
>>>>
>>>> What exact information you want on the memcg oom that will be helpful
>>>> for the users in general? You mentioned memory allocation information,
>>>> can you please elaborate a bit more.
>>>>
>>>
>>> As in my reply to Suren, I was thinking the system-wide memory usage info
>>> provided by show_free_pages and memory allocation profiling info can help
>>> us debug cgoom by comparing them with historical data. What is your take on
>>> this?
>>>
>>
>> I am not really sure about show_free_areas(). More specifically how the
>> historical data diff will be useful for a memcg oom. If you have a
>> concrete example, please give one. For memory allocation profiling, is
>> it possible to filter for the given memcg? Do we save memcg information
>> in the memory allocation profiling?
> 
> Actually I was thinking about making memory profiling memcg-aware but
> it would be quite costly both from memory and performance points of
> view. Currently we have a per-cpu counter for each allocation in the
> kernel codebase. To make it work for each memcg we would have to add
> memcg dimension to the counters, so each counter becomes per-cpu plus
> per-memcg. I'll be thinking about possible optimizations since many of
> these counters will stay at 0 but any such optimization would come at
> a performance cost, which we tried to keep at the absolute minimum.
> 
> I'm CC'ing Sourav and Pasha since they were also interested in making
> memory allocation profiling memcg-aware. Would Meta folks (Usama,
> Shakeel, Johannes) be interested in such enhancement as well? Would it
> be preferable to have such accounting for a specific memcg which we
> pre-select (less memory and performance overhead) or we need that for
> all memcgs as a generic feature? We have some options here but I want
> to understand what would be sufficient and add as little overhead as
> possible.

Yes, having per memcg counters is going to be extremely useful (we were
thinking of having this as a future project to work on). For meta fleet
in particular, we might have almost 100 memcgs running, but the number
of memcgs running workloads is particularly small (usually less than 10).
In the rest, you might have services that are responsible for telemetry,
monitoring, security, etc (for which we arent really interested in
the memory allocation profile). So yes, it would be ideal to have the
profile for just pre-select memcgs, especially if it leads to lower memory
and performance overhead.

Having memory allocation profile at memcg level is especially needed when
we have multiple workloads stacked on the same host. Having it at host
level in such a case makes the data less useful when we have OOMs and
for workload analysis as you dont know which workload is contributing
how much.

> Thanks,
> Suren.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Yueyang Pan 1 month, 1 week ago

On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup 
> > > > limit, we won't get memory allocation infomation. In some cases, we 
> > > > can have a large cgroup workload running which dominates the machine. 
> > > > The reason using cgroup is to leave some resource for system. When this 
> > > > cgroup is killed, we would also like to have some memory allocation 
> > > > information for the whole server as well. This is reason behind this 
> > > > mini change. Is it an acceptable thing to do? Will it be too much 
> > > > information for people? I am happy with any suggestions!
> > > 
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> > 
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> > 
> > > 
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > > 
> > 
> > As in my reply to Suren, I was thinking the system-wide memory usage info 
> > provided by show_free_pages and memory allocation profiling info can help 
> > us debug cgoom by comparing them with historical data. What is your take on 
> > this?
> > 
> 
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is

Sorry for my late reply. I have been trying hard to think about a use case. 
One specific case I can think about is when there is no workload stacking, 
when one job is running solely on the machine. For example, memory allocation 
profiling can tell the memory usage of the network driver, which can make 
cg allocates memory harder and eventually leads to cgoom. Without this 
information, it would be hard to reason about what is happening in the kernel 
given increased oom number.

show_free_areas() will give a summary of different types of memory which 
can possibably lead to increased cgoom in my previous case. Then one looks 
deeper via the memory allocation profiling as an entrypoint to debug.

Does this make sense to you?

> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?

Thanks
Pan

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Suren Baghdasaryan 1 month, 1 week ago

On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
>
> On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
>
> Sorry for my late reply. I have been trying hard to think about a use case.
> One specific case I can think about is when there is no workload stacking,
> when one job is running solely on the machine. For example, memory allocation
> profiling can tell the memory usage of the network driver, which can make
> cg allocates memory harder and eventually leads to cgoom. Without this
> information, it would be hard to reason about what is happening in the kernel
> given increased oom number.
>
> show_free_areas() will give a summary of different types of memory which
> can possibably lead to increased cgoom in my previous case. Then one looks
> deeper via the memory allocation profiling as an entrypoint to debug.
>
> Does this make sense to you?

I think if we had per-memcg memory profiling that would make sense.
Counters would reflect only allocations made by the processes from
that memcg and you could easily identify the allocation that caused
memcg to oom. But dumping system-wide profiling information at
memcg-oom time I think would not help you with this task. It will be
polluted with allocations from other memcgs, so likely won't help much
(unless there is some obvious leak or you know that a specific
allocation is done only by a process from your memcg and no other
process).

>
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
>
> Thanks
> Pan

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Michal Hocko 1 month ago

On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote:
> On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> >
> > Sorry for my late reply. I have been trying hard to think about a use case.
> > One specific case I can think about is when there is no workload stacking,
> > when one job is running solely on the machine. For example, memory allocation
> > profiling can tell the memory usage of the network driver, which can make
> > cg allocates memory harder and eventually leads to cgoom. Without this
> > information, it would be hard to reason about what is happening in the kernel
> > given increased oom number.
> >
> > show_free_areas() will give a summary of different types of memory which
> > can possibably lead to increased cgoom in my previous case. Then one looks
> > deeper via the memory allocation profiling as an entrypoint to debug.
> >
> > Does this make sense to you?
> 
> I think if we had per-memcg memory profiling that would make sense.
> Counters would reflect only allocations made by the processes from
> that memcg and you could easily identify the allocation that caused
> memcg to oom. But dumping system-wide profiling information at
> memcg-oom time I think would not help you with this task. It will be
> polluted with allocations from other memcgs, so likely won't help much
> (unless there is some obvious leak or you know that a specific
> allocation is done only by a process from your memcg and no other
> process).

I agree with Suren. It makes very little sense and in many cases it
could be actively misleading to print global memory state on memcg OOMs.
Not to mention that those events, unlike global OOMs, could happen much
more often.
If you are interested in a more information on memcg oom occurance you
can detext OOM events and print whatever information you need.
-- 
Michal Hocko
SUSE Labs

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Kent Overstreet 3 weeks, 4 days ago

On Fri, Aug 29, 2025 at 08:35:08AM +0200, Michal Hocko wrote:
> On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote:
> > On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > > information for the whole server as well. This is reason behind this
> > > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > > information for people? I am happy with any suggestions!
> > > > > >
> > > > > > For a single patch, it is better to have all the context in the patch
> > > > > > and there is no need for cover letter.
> > > > >
> > > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > > >
> > > > > >
> > > > > > What exact information you want on the memcg oom that will be helpful
> > > > > > for the users in general? You mentioned memory allocation information,
> > > > > > can you please elaborate a bit more.
> > > > > >
> > > > >
> > > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > > provided by show_free_pages and memory allocation profiling info can help
> > > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > > this?
> > > > >
> > > >
> > > > I am not really sure about show_free_areas(). More specifically how the
> > > > historical data diff will be useful for a memcg oom. If you have a
> > > > concrete example, please give one. For memory allocation profiling, is
> > >
> > > Sorry for my late reply. I have been trying hard to think about a use case.
> > > One specific case I can think about is when there is no workload stacking,
> > > when one job is running solely on the machine. For example, memory allocation
> > > profiling can tell the memory usage of the network driver, which can make
> > > cg allocates memory harder and eventually leads to cgoom. Without this
> > > information, it would be hard to reason about what is happening in the kernel
> > > given increased oom number.
> > >
> > > show_free_areas() will give a summary of different types of memory which
> > > can possibably lead to increased cgoom in my previous case. Then one looks
> > > deeper via the memory allocation profiling as an entrypoint to debug.
> > >
> > > Does this make sense to you?
> > 
> > I think if we had per-memcg memory profiling that would make sense.
> > Counters would reflect only allocations made by the processes from
> > that memcg and you could easily identify the allocation that caused
> > memcg to oom. But dumping system-wide profiling information at
> > memcg-oom time I think would not help you with this task. It will be
> > polluted with allocations from other memcgs, so likely won't help much
> > (unless there is some obvious leak or you know that a specific
> > allocation is done only by a process from your memcg and no other
> > process).
> 
> I agree with Suren. It makes very little sense and in many cases it
> could be actively misleading to print global memory state on memcg OOMs.
> Not to mention that those events, unlike global OOMs, could happen much
> more often.
> If you are interested in a more information on memcg oom occurance you
> can detext OOM events and print whatever information you need.

"Misleading" is a concern; the show_mem report would want to print very
explicitly which information is specifically for the memcg and which is
global, and we don't do that now.

I don't think that means we shouldn't print it at all though, because it
can happen that we're in an OOM because one specific codepath is
allocating way more memory than we should be; even if the memory
allocation profiling info isn't correct for the memcg it'll be useful
information in a situation like that, it just needs to very clearly
state what it's reporting on.

I'm not sure we do that very well at all now, I'm looking at
__show_mem() ad it's not even passed a memcg. !?

Also, if anyone's thinking about "what if memory allocation profiling
was memcg aware" - the thing we saw when doing performance testing is
that memcg accounting was much higher overhead than memory allocation
profiling - hence, most kernel memory allocations don't even get memcg
accounting.

I think that got the memcg people looking at ways to make the accounting
cheaper, but I'm not sure if anything landed from that.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Michal Hocko 3 weeks, 4 days ago

On Mon 08-09-25 13:34:17, Kent Overstreet wrote:
[...]
> I'm not sure we do that very well at all now, I'm looking at
> __show_mem() ad it's not even passed a memcg. !?

__show_mem is not called from memcg oom path. There is
mem_cgroup_print_oom_meminfo for that purpose as memcg stats are
sufficiently different to have their own thing.

Now, what mem_cgroup_print_oom_meminfo prints is missing more detailed
slab memory usage break down. This might be useful in some situations.
If we had memcg aware memory profiling then it could be added to that
path as well. 

-- 
Michal Hocko
SUSE Labs

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Suren Baghdasaryan 3 weeks, 4 days ago

On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Fri, Aug 29, 2025 at 08:35:08AM +0200, Michal Hocko wrote:
> > On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote:
> > > On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote:
> > > >
> > > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > > > information for the whole server as well. This is reason behind this
> > > > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > > > information for people? I am happy with any suggestions!
> > > > > > >
> > > > > > > For a single patch, it is better to have all the context in the patch
> > > > > > > and there is no need for cover letter.
> > > > > >
> > > > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > > > >
> > > > > > >
> > > > > > > What exact information you want on the memcg oom that will be helpful
> > > > > > > for the users in general? You mentioned memory allocation information,
> > > > > > > can you please elaborate a bit more.
> > > > > > >
> > > > > >
> > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > > > provided by show_free_pages and memory allocation profiling info can help
> > > > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > > > this?
> > > > > >
> > > > >
> > > > > I am not really sure about show_free_areas(). More specifically how the
> > > > > historical data diff will be useful for a memcg oom. If you have a
> > > > > concrete example, please give one. For memory allocation profiling, is
> > > >
> > > > Sorry for my late reply. I have been trying hard to think about a use case.
> > > > One specific case I can think about is when there is no workload stacking,
> > > > when one job is running solely on the machine. For example, memory allocation
> > > > profiling can tell the memory usage of the network driver, which can make
> > > > cg allocates memory harder and eventually leads to cgoom. Without this
> > > > information, it would be hard to reason about what is happening in the kernel
> > > > given increased oom number.
> > > >
> > > > show_free_areas() will give a summary of different types of memory which
> > > > can possibably lead to increased cgoom in my previous case. Then one looks
> > > > deeper via the memory allocation profiling as an entrypoint to debug.
> > > >
> > > > Does this make sense to you?
> > >
> > > I think if we had per-memcg memory profiling that would make sense.
> > > Counters would reflect only allocations made by the processes from
> > > that memcg and you could easily identify the allocation that caused
> > > memcg to oom. But dumping system-wide profiling information at
> > > memcg-oom time I think would not help you with this task. It will be
> > > polluted with allocations from other memcgs, so likely won't help much
> > > (unless there is some obvious leak or you know that a specific
> > > allocation is done only by a process from your memcg and no other
> > > process).
> >
> > I agree with Suren. It makes very little sense and in many cases it
> > could be actively misleading to print global memory state on memcg OOMs.
> > Not to mention that those events, unlike global OOMs, could happen much
> > more often.
> > If you are interested in a more information on memcg oom occurance you
> > can detext OOM events and print whatever information you need.
>
> "Misleading" is a concern; the show_mem report would want to print very
> explicitly which information is specifically for the memcg and which is
> global, and we don't do that now.
>
> I don't think that means we shouldn't print it at all though, because it
> can happen that we're in an OOM because one specific codepath is
> allocating way more memory than we should be; even if the memory
> allocation profiling info isn't correct for the memcg it'll be useful
> information in a situation like that, it just needs to very clearly
> state what it's reporting on.
>
> I'm not sure we do that very well at all now, I'm looking at
> __show_mem() ad it's not even passed a memcg. !?
>
> Also, if anyone's thinking about "what if memory allocation profiling
> was memcg aware" - the thing we saw when doing performance testing is
> that memcg accounting was much higher overhead than memory allocation
> profiling - hence, most kernel memory allocations don't even get memcg
> accounting.
>
> I think that got the memcg people looking at ways to make the accounting
> cheaper, but I'm not sure if anything landed from that.

Yes, Roman landed a series of changes reducing the memcg accounting overhead.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Kent Overstreet 3 weeks, 4 days ago

On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote:
> On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > I think that got the memcg people looking at ways to make the accounting
> > cheaper, but I'm not sure if anything landed from that.
> 
> Yes, Roman landed a series of changes reducing the memcg accounting overhead.

Do you know offhand how big that was?

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Suren Baghdasaryan 3 weeks, 4 days ago

On Mon, Sep 8, 2025 at 10:49 AM Kent Overstreet
<kent.overstreet@linux.dev> wrote:
>
> On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > I think that got the memcg people looking at ways to make the accounting
> > > cheaper, but I'm not sure if anything landed from that.
> >
> > Yes, Roman landed a series of changes reducing the memcg accounting overhead.
>
> Do you know offhand how big that was?

I'll need to dig it up but it was still much higher than memory profiling.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Shakeel Butt 3 weeks, 4 days ago

On Mon, Sep 08, 2025 at 10:51:30AM -0700, Suren Baghdasaryan wrote:
> On Mon, Sep 8, 2025 at 10:49 AM Kent Overstreet
> <kent.overstreet@linux.dev> wrote:
> >
> > On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet
> > > <kent.overstreet@linux.dev> wrote:
> > > >
> > > > I think that got the memcg people looking at ways to make the accounting
> > > > cheaper, but I'm not sure if anything landed from that.
> > >
> > > Yes, Roman landed a series of changes reducing the memcg accounting overhead.
> >
> > Do you know offhand how big that was?
> 
> I'll need to dig it up but it was still much higher than memory profiling.

What benchmark/workload was used to compare memcg accounting and memory
profiling?

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Suren Baghdasaryan 3 weeks, 4 days ago

On Mon, Sep 8, 2025 at 12:08 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Sep 08, 2025 at 10:51:30AM -0700, Suren Baghdasaryan wrote:
> > On Mon, Sep 8, 2025 at 10:49 AM Kent Overstreet
> > <kent.overstreet@linux.dev> wrote:
> > >
> > > On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote:
> > > > On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet
> > > > <kent.overstreet@linux.dev> wrote:
> > > > >
> > > > > I think that got the memcg people looking at ways to make the accounting
> > > > > cheaper, but I'm not sure if anything landed from that.
> > > >
> > > > Yes, Roman landed a series of changes reducing the memcg accounting overhead.
> > >
> > > Do you know offhand how big that was?
> >
> > I'll need to dig it up but it was still much higher than memory profiling.
>
> What benchmark/workload was used to compare memcg accounting and memory
> profiling?

It was an in-kernel allocation stress test. Not very realistic but
good for comparing the overhead.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Suren Baghdasaryan 1 month, 1 week ago

On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > can have a large cgroup workload running which dominates the machine.
> > > > The reason using cgroup is to leave some resource for system. When this
> > > > cgroup is killed, we would also like to have some memory allocation
> > > > information for the whole server as well. This is reason behind this
> > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > information for people? I am happy with any suggestions!
> > >
> > > For a single patch, it is better to have all the context in the patch
> > > and there is no need for cover letter.
> >
> > Thanks for your suggestion Shakeel! I will change this in the next version.
> >
> > >
> > > What exact information you want on the memcg oom that will be helpful
> > > for the users in general? You mentioned memory allocation information,
> > > can you please elaborate a bit more.
> > >
> >
> > As in my reply to Suren, I was thinking the system-wide memory usage info
> > provided by show_free_pages and memory allocation profiling info can help
> > us debug cgoom by comparing them with historical data. What is your take on
> > this?
> >
>
> I am not really sure about show_free_areas(). More specifically how the
> historical data diff will be useful for a memcg oom. If you have a
> concrete example, please give one. For memory allocation profiling, is
> it possible to filter for the given memcg? Do we save memcg information
> in the memory allocation profiling?

No, memory allocation profiling is not cgroup-aware. It tracks
allocations and their code locations but no other context.

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Shakeel Butt 1 month, 1 week ago

On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote:
> On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > can have a large cgroup workload running which dominates the machine.
> > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > information for the whole server as well. This is reason behind this
> > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > information for people? I am happy with any suggestions!
> > > >
> > > > For a single patch, it is better to have all the context in the patch
> > > > and there is no need for cover letter.
> > >
> > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > >
> > > >
> > > > What exact information you want on the memcg oom that will be helpful
> > > > for the users in general? You mentioned memory allocation information,
> > > > can you please elaborate a bit more.
> > > >
> > >
> > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > provided by show_free_pages and memory allocation profiling info can help
> > > us debug cgoom by comparing them with historical data. What is your take on
> > > this?
> > >
> >
> > I am not really sure about show_free_areas(). More specifically how the
> > historical data diff will be useful for a memcg oom. If you have a
> > concrete example, please give one. For memory allocation profiling, is
> > it possible to filter for the given memcg? Do we save memcg information
> > in the memory allocation profiling?
> 
> No, memory allocation profiling is not cgroup-aware. It tracks
> allocations and their code locations but no other context.

Thanks for the info. Pan, will having memcg info along with allocation
profile help your use-case? (Though adding that might not be easy or
cheaper)

Re: [RFC 0/1] Try to add memory allocation info for cgroup oom kill

Posted by Yueyang Pan 1 month, 1 week ago

On Thu, Aug 21, 2025 at 02:26:42PM -0700, Shakeel Butt wrote:
> On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote:
> > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote:
> > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote:
> > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote:
> > > > > > Right now in the oom_kill_process if the oom is because of the cgroup
> > > > > > limit, we won't get memory allocation infomation. In some cases, we
> > > > > > can have a large cgroup workload running which dominates the machine.
> > > > > > The reason using cgroup is to leave some resource for system. When this
> > > > > > cgroup is killed, we would also like to have some memory allocation
> > > > > > information for the whole server as well. This is reason behind this
> > > > > > mini change. Is it an acceptable thing to do? Will it be too much
> > > > > > information for people? I am happy with any suggestions!
> > > > >
> > > > > For a single patch, it is better to have all the context in the patch
> > > > > and there is no need for cover letter.
> > > >
> > > > Thanks for your suggestion Shakeel! I will change this in the next version.
> > > >
> > > > >
> > > > > What exact information you want on the memcg oom that will be helpful
> > > > > for the users in general? You mentioned memory allocation information,
> > > > > can you please elaborate a bit more.
> > > > >
> > > >
> > > > As in my reply to Suren, I was thinking the system-wide memory usage info
> > > > provided by show_free_pages and memory allocation profiling info can help
> > > > us debug cgoom by comparing them with historical data. What is your take on
> > > > this?
> > > >
> > >
> > > I am not really sure about show_free_areas(). More specifically how the
> > > historical data diff will be useful for a memcg oom. If you have a
> > > concrete example, please give one. For memory allocation profiling, is
> > > it possible to filter for the given memcg? Do we save memcg information
> > > in the memory allocation profiling?
> > 
> > No, memory allocation profiling is not cgroup-aware. It tracks
> > allocations and their code locations but no other context.
> 
> Thanks for the info. Pan, will having memcg info along with allocation
> profile help your use-case? (Though adding that might not be easy or
> cheaper)

Yeah I have been thinking about it with eBPF hooks but it is going to be a long 
term effort as we need to measure the overhead. Now the way memory profiling is 
implemented incur almost "zero" overhead.