mm/oom_kill.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
Right now in the oom_kill_process if the oom is because of the cgroup limit, we won't get memory allocation infomation. In some cases, we can have a large cgroup workload running which dominates the machine. The reason using cgroup is to leave some resource for system. When this cgroup is killed, we would also like to have some memory allocation information for the whole server as well. This is reason behind this mini change. Is it an acceptable thing to do? Will it be too much information for people? I am happy with any suggestions! Yueyang Pan (1): Add memory allocation info for cgroup oom mm/oom_kill.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- 2.47.3
On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > Right now in the oom_kill_process if the oom is because of the cgroup > limit, we won't get memory allocation infomation. In some cases, we > can have a large cgroup workload running which dominates the machine. > The reason using cgroup is to leave some resource for system. When this > cgroup is killed, we would also like to have some memory allocation > information for the whole server as well. This is reason behind this > mini change. Is it an acceptable thing to do? Will it be too much > information for people? I am happy with any suggestions! For a single patch, it is better to have all the context in the patch and there is no need for cover letter. What exact information you want on the memcg oom that will be helpful for the users in general? You mentioned memory allocation information, can you please elaborate a bit more.
On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > Right now in the oom_kill_process if the oom is because of the cgroup > > limit, we won't get memory allocation infomation. In some cases, we > > can have a large cgroup workload running which dominates the machine. > > The reason using cgroup is to leave some resource for system. When this > > cgroup is killed, we would also like to have some memory allocation > > information for the whole server as well. This is reason behind this > > mini change. Is it an acceptable thing to do? Will it be too much > > information for people? I am happy with any suggestions! > > For a single patch, it is better to have all the context in the patch > and there is no need for cover letter. Thanks for your suggestion Shakeel! I will change this in the next version. > > What exact information you want on the memcg oom that will be helpful > for the users in general? You mentioned memory allocation information, > can you please elaborate a bit more. > As in my reply to Suren, I was thinking the system-wide memory usage info provided by show_free_pages and memory allocation profiling info can help us debug cgoom by comparing them with historical data. What is your take on this? Thanks, Pan
On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > limit, we won't get memory allocation infomation. In some cases, we > > > can have a large cgroup workload running which dominates the machine. > > > The reason using cgroup is to leave some resource for system. When this > > > cgroup is killed, we would also like to have some memory allocation > > > information for the whole server as well. This is reason behind this > > > mini change. Is it an acceptable thing to do? Will it be too much > > > information for people? I am happy with any suggestions! > > > > For a single patch, it is better to have all the context in the patch > > and there is no need for cover letter. > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > What exact information you want on the memcg oom that will be helpful > > for the users in general? You mentioned memory allocation information, > > can you please elaborate a bit more. > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > provided by show_free_pages and memory allocation profiling info can help > us debug cgoom by comparing them with historical data. What is your take on > this? > I am not really sure about show_free_areas(). More specifically how the historical data diff will be useful for a memcg oom. If you have a concrete example, please give one. For memory allocation profiling, is it possible to filter for the given memcg? Do we save memcg information in the memory allocation profiling?
On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > can have a large cgroup workload running which dominates the machine. > > > > The reason using cgroup is to leave some resource for system. When this > > > > cgroup is killed, we would also like to have some memory allocation > > > > information for the whole server as well. This is reason behind this > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > information for people? I am happy with any suggestions! > > > > > > For a single patch, it is better to have all the context in the patch > > > and there is no need for cover letter. > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > for the users in general? You mentioned memory allocation information, > > > can you please elaborate a bit more. > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > provided by show_free_pages and memory allocation profiling info can help > > us debug cgoom by comparing them with historical data. What is your take on > > this? > > > > I am not really sure about show_free_areas(). More specifically how the > historical data diff will be useful for a memcg oom. If you have a > concrete example, please give one. For memory allocation profiling, is > it possible to filter for the given memcg? Do we save memcg information > in the memory allocation profiling? Actually I was thinking about making memory profiling memcg-aware but it would be quite costly both from memory and performance points of view. Currently we have a per-cpu counter for each allocation in the kernel codebase. To make it work for each memcg we would have to add memcg dimension to the counters, so each counter becomes per-cpu plus per-memcg. I'll be thinking about possible optimizations since many of these counters will stay at 0 but any such optimization would come at a performance cost, which we tried to keep at the absolute minimum. I'm CC'ing Sourav and Pasha since they were also interested in making memory allocation profiling memcg-aware. Would Meta folks (Usama, Shakeel, Johannes) be interested in such enhancement as well? Would it be preferable to have such accounting for a specific memcg which we pre-select (less memory and performance overhead) or we need that for all memcgs as a generic feature? We have some options here but I want to understand what would be sufficient and add as little overhead as possible. Thanks, Suren.
On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote: > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > can have a large cgroup workload running which dominates the machine. > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > information for the whole server as well. This is reason behind this > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > information for people? I am happy with any suggestions! > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > and there is no need for cover letter. > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > for the users in general? You mentioned memory allocation information, > > > > can you please elaborate a bit more. > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > provided by show_free_pages and memory allocation profiling info can help > > > us debug cgoom by comparing them with historical data. What is your take on > > > this? > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > historical data diff will be useful for a memcg oom. If you have a > > concrete example, please give one. For memory allocation profiling, is > > it possible to filter for the given memcg? Do we save memcg information > > in the memory allocation profiling? > > Actually I was thinking about making memory profiling memcg-aware but > it would be quite costly both from memory and performance points of > view. Currently we have a per-cpu counter for each allocation in the > kernel codebase. To make it work for each memcg we would have to add > memcg dimension to the counters, so each counter becomes per-cpu plus > per-memcg. I'll be thinking about possible optimizations since many of > these counters will stay at 0 but any such optimization would come at > a performance cost, which we tried to keep at the absolute minimum. > > I'm CC'ing Sourav and Pasha since they were also interested in making > memory allocation profiling memcg-aware. Would Meta folks (Usama, > Shakeel, Johannes) be interested in such enhancement as well? Would it > be preferable to have such accounting for a specific memcg which we > pre-select (less memory and performance overhead) or we need that for > all memcgs as a generic feature? We have some options here but I want > to understand what would be sufficient and add as little overhead as > possible. Thanks Suren, yes, as already mentioned by Usama, Meta will be interested in memcg aware allocation profiling. I would say start simple and as little overhead as possible. More functionality can be added later when the need arises. Maybe the first useful addition is just adding how many allocations for a specific allocation site are memcg charged.
On Wed, Aug 27, 2025 at 2:15 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Tue, Aug 26, 2025 at 07:32:17PM -0700, Suren Baghdasaryan wrote: > > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > information for the whole server as well. This is reason behind this > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > and there is no need for cover letter. > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > for the users in general? You mentioned memory allocation information, > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > provided by show_free_pages and memory allocation profiling info can help > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > this? > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > historical data diff will be useful for a memcg oom. If you have a > > > concrete example, please give one. For memory allocation profiling, is > > > it possible to filter for the given memcg? Do we save memcg information > > > in the memory allocation profiling? > > > > Actually I was thinking about making memory profiling memcg-aware but > > it would be quite costly both from memory and performance points of > > view. Currently we have a per-cpu counter for each allocation in the > > kernel codebase. To make it work for each memcg we would have to add > > memcg dimension to the counters, so each counter becomes per-cpu plus > > per-memcg. I'll be thinking about possible optimizations since many of > > these counters will stay at 0 but any such optimization would come at > > a performance cost, which we tried to keep at the absolute minimum. > > > > I'm CC'ing Sourav and Pasha since they were also interested in making > > memory allocation profiling memcg-aware. Would Meta folks (Usama, > > Shakeel, Johannes) be interested in such enhancement as well? Would it > > be preferable to have such accounting for a specific memcg which we > > pre-select (less memory and performance overhead) or we need that for > > all memcgs as a generic feature? We have some options here but I want > > to understand what would be sufficient and add as little overhead as > > possible. > > Thanks Suren, yes, as already mentioned by Usama, Meta will be > interested in memcg aware allocation profiling. I would say start simple > and as little overhead as possible. More functionality can be added > later when the need arises. Maybe the first useful addition is just > adding how many allocations for a specific allocation site are memcg > charged. Adding back Sourav, Pasha and Johannes who got accidentally dropped in the replies. I looked a bit into adding memcg-awareness into memory allocation profiling and it's more complicated than I first thought (as usual). The main complication is that we need to add memcg_id or some other memcg identifier into codetag_ref. That's needed so that we can unaccount the correct memcg when we free an allocation - that's the usual function of the codetag_ref. Now, extending codetag_ref is not a problem by itself but when we use mem_profiling_compressed mode, we store an index of the codetag instead of codetag_ref in the unused page flag bits. This is useful optimization to avoid using page_ext and overhead associated with it. So, full blown memcg support seems problematic. What I'm thinking is easily doable is a filtering interface where we could select a specific memcg to be profiled, IOW we profile only allocations from a chosen memcg. Filtering can be done using ioctl interface on /proc/allocinfo, which can be used for other things as well, like filtering non-zero allocations, returning per-NUMA node information, etc. I see that Damon uses similar memcg filtering (see damos_filter.memcg_id), so I can reuse some of that code for implementing this facility. From high-level, userspace will be able to select one memcg at a time to be profiled. At some later time profiling information is gathered and another memcg can be selected or filtering can be reset to profile all allocations from all memcgs. I expect overhead for this kind of memcg filtering to be quite low. WDYT folks, would this be helpful and cover your usecases? >
On 27/08/2025 03:32, Suren Baghdasaryan wrote: > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> >> On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: >>> On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: >>>> On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: >>>>> Right now in the oom_kill_process if the oom is because of the cgroup >>>>> limit, we won't get memory allocation infomation. In some cases, we >>>>> can have a large cgroup workload running which dominates the machine. >>>>> The reason using cgroup is to leave some resource for system. When this >>>>> cgroup is killed, we would also like to have some memory allocation >>>>> information for the whole server as well. This is reason behind this >>>>> mini change. Is it an acceptable thing to do? Will it be too much >>>>> information for people? I am happy with any suggestions! >>>> >>>> For a single patch, it is better to have all the context in the patch >>>> and there is no need for cover letter. >>> >>> Thanks for your suggestion Shakeel! I will change this in the next version. >>> >>>> >>>> What exact information you want on the memcg oom that will be helpful >>>> for the users in general? You mentioned memory allocation information, >>>> can you please elaborate a bit more. >>>> >>> >>> As in my reply to Suren, I was thinking the system-wide memory usage info >>> provided by show_free_pages and memory allocation profiling info can help >>> us debug cgoom by comparing them with historical data. What is your take on >>> this? >>> >> >> I am not really sure about show_free_areas(). More specifically how the >> historical data diff will be useful for a memcg oom. If you have a >> concrete example, please give one. For memory allocation profiling, is >> it possible to filter for the given memcg? Do we save memcg information >> in the memory allocation profiling? > > Actually I was thinking about making memory profiling memcg-aware but > it would be quite costly both from memory and performance points of > view. Currently we have a per-cpu counter for each allocation in the > kernel codebase. To make it work for each memcg we would have to add > memcg dimension to the counters, so each counter becomes per-cpu plus > per-memcg. I'll be thinking about possible optimizations since many of > these counters will stay at 0 but any such optimization would come at > a performance cost, which we tried to keep at the absolute minimum. > > I'm CC'ing Sourav and Pasha since they were also interested in making > memory allocation profiling memcg-aware. Would Meta folks (Usama, > Shakeel, Johannes) be interested in such enhancement as well? Would it > be preferable to have such accounting for a specific memcg which we > pre-select (less memory and performance overhead) or we need that for > all memcgs as a generic feature? We have some options here but I want > to understand what would be sufficient and add as little overhead as > possible. Yes, having per memcg counters is going to be extremely useful (we were thinking of having this as a future project to work on). For meta fleet in particular, we might have almost 100 memcgs running, but the number of memcgs running workloads is particularly small (usually less than 10). In the rest, you might have services that are responsible for telemetry, monitoring, security, etc (for which we arent really interested in the memory allocation profile). So yes, it would be ideal to have the profile for just pre-select memcgs, especially if it leads to lower memory and performance overhead. Having memory allocation profile at memcg level is especially needed when we have multiple workloads stacked on the same host. Having it at host level in such a case makes the data less useful when we have OOMs and for workload analysis as you dont know which workload is contributing how much. > Thanks, > Suren.
On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > can have a large cgroup workload running which dominates the machine. > > > > The reason using cgroup is to leave some resource for system. When this > > > > cgroup is killed, we would also like to have some memory allocation > > > > information for the whole server as well. This is reason behind this > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > information for people? I am happy with any suggestions! > > > > > > For a single patch, it is better to have all the context in the patch > > > and there is no need for cover letter. > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > for the users in general? You mentioned memory allocation information, > > > can you please elaborate a bit more. > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > provided by show_free_pages and memory allocation profiling info can help > > us debug cgoom by comparing them with historical data. What is your take on > > this? > > > > I am not really sure about show_free_areas(). More specifically how the > historical data diff will be useful for a memcg oom. If you have a > concrete example, please give one. For memory allocation profiling, is Sorry for my late reply. I have been trying hard to think about a use case. One specific case I can think about is when there is no workload stacking, when one job is running solely on the machine. For example, memory allocation profiling can tell the memory usage of the network driver, which can make cg allocates memory harder and eventually leads to cgoom. Without this information, it would be hard to reason about what is happening in the kernel given increased oom number. show_free_areas() will give a summary of different types of memory which can possibably lead to increased cgoom in my previous case. Then one looks deeper via the memory allocation profiling as an entrypoint to debug. Does this make sense to you? > it possible to filter for the given memcg? Do we save memcg information > in the memory allocation profiling? Thanks Pan
On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > can have a large cgroup workload running which dominates the machine. > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > information for the whole server as well. This is reason behind this > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > information for people? I am happy with any suggestions! > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > and there is no need for cover letter. > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > for the users in general? You mentioned memory allocation information, > > > > can you please elaborate a bit more. > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > provided by show_free_pages and memory allocation profiling info can help > > > us debug cgoom by comparing them with historical data. What is your take on > > > this? > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > historical data diff will be useful for a memcg oom. If you have a > > concrete example, please give one. For memory allocation profiling, is > > Sorry for my late reply. I have been trying hard to think about a use case. > One specific case I can think about is when there is no workload stacking, > when one job is running solely on the machine. For example, memory allocation > profiling can tell the memory usage of the network driver, which can make > cg allocates memory harder and eventually leads to cgoom. Without this > information, it would be hard to reason about what is happening in the kernel > given increased oom number. > > show_free_areas() will give a summary of different types of memory which > can possibably lead to increased cgoom in my previous case. Then one looks > deeper via the memory allocation profiling as an entrypoint to debug. > > Does this make sense to you? I think if we had per-memcg memory profiling that would make sense. Counters would reflect only allocations made by the processes from that memcg and you could easily identify the allocation that caused memcg to oom. But dumping system-wide profiling information at memcg-oom time I think would not help you with this task. It will be polluted with allocations from other memcgs, so likely won't help much (unless there is some obvious leak or you know that a specific allocation is done only by a process from your memcg and no other process). > > > it possible to filter for the given memcg? Do we save memcg information > > in the memory allocation profiling? > > Thanks > Pan
On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote: > On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > information for the whole server as well. This is reason behind this > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > and there is no need for cover letter. > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > for the users in general? You mentioned memory allocation information, > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > provided by show_free_pages and memory allocation profiling info can help > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > this? > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > historical data diff will be useful for a memcg oom. If you have a > > > concrete example, please give one. For memory allocation profiling, is > > > > Sorry for my late reply. I have been trying hard to think about a use case. > > One specific case I can think about is when there is no workload stacking, > > when one job is running solely on the machine. For example, memory allocation > > profiling can tell the memory usage of the network driver, which can make > > cg allocates memory harder and eventually leads to cgoom. Without this > > information, it would be hard to reason about what is happening in the kernel > > given increased oom number. > > > > show_free_areas() will give a summary of different types of memory which > > can possibably lead to increased cgoom in my previous case. Then one looks > > deeper via the memory allocation profiling as an entrypoint to debug. > > > > Does this make sense to you? > > I think if we had per-memcg memory profiling that would make sense. > Counters would reflect only allocations made by the processes from > that memcg and you could easily identify the allocation that caused > memcg to oom. But dumping system-wide profiling information at > memcg-oom time I think would not help you with this task. It will be > polluted with allocations from other memcgs, so likely won't help much > (unless there is some obvious leak or you know that a specific > allocation is done only by a process from your memcg and no other > process). I agree with Suren. It makes very little sense and in many cases it could be actively misleading to print global memory state on memcg OOMs. Not to mention that those events, unlike global OOMs, could happen much more often. If you are interested in a more information on memcg oom occurance you can detext OOM events and print whatever information you need. -- Michal Hocko SUSE Labs
On Fri, Aug 29, 2025 at 08:35:08AM +0200, Michal Hocko wrote: > On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote: > > On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > > > > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > > information for the whole server as well. This is reason behind this > > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > > and there is no need for cover letter. > > > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > > for the users in general? You mentioned memory allocation information, > > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > > provided by show_free_pages and memory allocation profiling info can help > > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > > this? > > > > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > > historical data diff will be useful for a memcg oom. If you have a > > > > concrete example, please give one. For memory allocation profiling, is > > > > > > Sorry for my late reply. I have been trying hard to think about a use case. > > > One specific case I can think about is when there is no workload stacking, > > > when one job is running solely on the machine. For example, memory allocation > > > profiling can tell the memory usage of the network driver, which can make > > > cg allocates memory harder and eventually leads to cgoom. Without this > > > information, it would be hard to reason about what is happening in the kernel > > > given increased oom number. > > > > > > show_free_areas() will give a summary of different types of memory which > > > can possibably lead to increased cgoom in my previous case. Then one looks > > > deeper via the memory allocation profiling as an entrypoint to debug. > > > > > > Does this make sense to you? > > > > I think if we had per-memcg memory profiling that would make sense. > > Counters would reflect only allocations made by the processes from > > that memcg and you could easily identify the allocation that caused > > memcg to oom. But dumping system-wide profiling information at > > memcg-oom time I think would not help you with this task. It will be > > polluted with allocations from other memcgs, so likely won't help much > > (unless there is some obvious leak or you know that a specific > > allocation is done only by a process from your memcg and no other > > process). > > I agree with Suren. It makes very little sense and in many cases it > could be actively misleading to print global memory state on memcg OOMs. > Not to mention that those events, unlike global OOMs, could happen much > more often. > If you are interested in a more information on memcg oom occurance you > can detext OOM events and print whatever information you need. "Misleading" is a concern; the show_mem report would want to print very explicitly which information is specifically for the memcg and which is global, and we don't do that now. I don't think that means we shouldn't print it at all though, because it can happen that we're in an OOM because one specific codepath is allocating way more memory than we should be; even if the memory allocation profiling info isn't correct for the memcg it'll be useful information in a situation like that, it just needs to very clearly state what it's reporting on. I'm not sure we do that very well at all now, I'm looking at __show_mem() ad it's not even passed a memcg. !? Also, if anyone's thinking about "what if memory allocation profiling was memcg aware" - the thing we saw when doing performance testing is that memcg accounting was much higher overhead than memory allocation profiling - hence, most kernel memory allocations don't even get memcg accounting. I think that got the memcg people looking at ways to make the accounting cheaper, but I'm not sure if anything landed from that.
On Mon 08-09-25 13:34:17, Kent Overstreet wrote: [...] > I'm not sure we do that very well at all now, I'm looking at > __show_mem() ad it's not even passed a memcg. !? __show_mem is not called from memcg oom path. There is mem_cgroup_print_oom_meminfo for that purpose as memcg stats are sufficiently different to have their own thing. Now, what mem_cgroup_print_oom_meminfo prints is missing more detailed slab memory usage break down. This might be useful in some situations. If we had memcg aware memory profiling then it could be added to that path as well. -- Michal Hocko SUSE Labs
On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Fri, Aug 29, 2025 at 08:35:08AM +0200, Michal Hocko wrote: > > On Tue 26-08-25 19:38:03, Suren Baghdasaryan wrote: > > > On Tue, Aug 26, 2025 at 7:06 AM Yueyang Pan <pyyjason@gmail.com> wrote: > > > > > > > > On Thu, Aug 21, 2025 at 12:53:03PM -0700, Shakeel Butt wrote: > > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > > > information for the whole server as well. This is reason behind this > > > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > > > and there is no need for cover letter. > > > > > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > > > for the users in general? You mentioned memory allocation information, > > > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > > > provided by show_free_pages and memory allocation profiling info can help > > > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > > > this? > > > > > > > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > > > historical data diff will be useful for a memcg oom. If you have a > > > > > concrete example, please give one. For memory allocation profiling, is > > > > > > > > Sorry for my late reply. I have been trying hard to think about a use case. > > > > One specific case I can think about is when there is no workload stacking, > > > > when one job is running solely on the machine. For example, memory allocation > > > > profiling can tell the memory usage of the network driver, which can make > > > > cg allocates memory harder and eventually leads to cgoom. Without this > > > > information, it would be hard to reason about what is happening in the kernel > > > > given increased oom number. > > > > > > > > show_free_areas() will give a summary of different types of memory which > > > > can possibably lead to increased cgoom in my previous case. Then one looks > > > > deeper via the memory allocation profiling as an entrypoint to debug. > > > > > > > > Does this make sense to you? > > > > > > I think if we had per-memcg memory profiling that would make sense. > > > Counters would reflect only allocations made by the processes from > > > that memcg and you could easily identify the allocation that caused > > > memcg to oom. But dumping system-wide profiling information at > > > memcg-oom time I think would not help you with this task. It will be > > > polluted with allocations from other memcgs, so likely won't help much > > > (unless there is some obvious leak or you know that a specific > > > allocation is done only by a process from your memcg and no other > > > process). > > > > I agree with Suren. It makes very little sense and in many cases it > > could be actively misleading to print global memory state on memcg OOMs. > > Not to mention that those events, unlike global OOMs, could happen much > > more often. > > If you are interested in a more information on memcg oom occurance you > > can detext OOM events and print whatever information you need. > > "Misleading" is a concern; the show_mem report would want to print very > explicitly which information is specifically for the memcg and which is > global, and we don't do that now. > > I don't think that means we shouldn't print it at all though, because it > can happen that we're in an OOM because one specific codepath is > allocating way more memory than we should be; even if the memory > allocation profiling info isn't correct for the memcg it'll be useful > information in a situation like that, it just needs to very clearly > state what it's reporting on. > > I'm not sure we do that very well at all now, I'm looking at > __show_mem() ad it's not even passed a memcg. !? > > Also, if anyone's thinking about "what if memory allocation profiling > was memcg aware" - the thing we saw when doing performance testing is > that memcg accounting was much higher overhead than memory allocation > profiling - hence, most kernel memory allocations don't even get memcg > accounting. > > I think that got the memcg people looking at ways to make the accounting > cheaper, but I'm not sure if anything landed from that. Yes, Roman landed a series of changes reducing the memcg accounting overhead.
On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote: > On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > I think that got the memcg people looking at ways to make the accounting > > cheaper, but I'm not sure if anything landed from that. > > Yes, Roman landed a series of changes reducing the memcg accounting overhead. Do you know offhand how big that was?
On Mon, Sep 8, 2025 at 10:49 AM Kent Overstreet <kent.overstreet@linux.dev> wrote: > > On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote: > > On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > I think that got the memcg people looking at ways to make the accounting > > > cheaper, but I'm not sure if anything landed from that. > > > > Yes, Roman landed a series of changes reducing the memcg accounting overhead. > > Do you know offhand how big that was? I'll need to dig it up but it was still much higher than memory profiling.
On Mon, Sep 08, 2025 at 10:51:30AM -0700, Suren Baghdasaryan wrote: > On Mon, Sep 8, 2025 at 10:49 AM Kent Overstreet > <kent.overstreet@linux.dev> wrote: > > > > On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote: > > > On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > I think that got the memcg people looking at ways to make the accounting > > > > cheaper, but I'm not sure if anything landed from that. > > > > > > Yes, Roman landed a series of changes reducing the memcg accounting overhead. > > > > Do you know offhand how big that was? > > I'll need to dig it up but it was still much higher than memory profiling. What benchmark/workload was used to compare memcg accounting and memory profiling?
On Mon, Sep 8, 2025 at 12:08 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Sep 08, 2025 at 10:51:30AM -0700, Suren Baghdasaryan wrote: > > On Mon, Sep 8, 2025 at 10:49 AM Kent Overstreet > > <kent.overstreet@linux.dev> wrote: > > > > > > On Mon, Sep 08, 2025 at 10:47:06AM -0700, Suren Baghdasaryan wrote: > > > > On Mon, Sep 8, 2025 at 10:34 AM Kent Overstreet > > > > <kent.overstreet@linux.dev> wrote: > > > > > > > > > > I think that got the memcg people looking at ways to make the accounting > > > > > cheaper, but I'm not sure if anything landed from that. > > > > > > > > Yes, Roman landed a series of changes reducing the memcg accounting overhead. > > > > > > Do you know offhand how big that was? > > > > I'll need to dig it up but it was still much higher than memory profiling. > > What benchmark/workload was used to compare memcg accounting and memory > profiling? It was an in-kernel allocation stress test. Not very realistic but good for comparing the overhead.
On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > can have a large cgroup workload running which dominates the machine. > > > > The reason using cgroup is to leave some resource for system. When this > > > > cgroup is killed, we would also like to have some memory allocation > > > > information for the whole server as well. This is reason behind this > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > information for people? I am happy with any suggestions! > > > > > > For a single patch, it is better to have all the context in the patch > > > and there is no need for cover letter. > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > for the users in general? You mentioned memory allocation information, > > > can you please elaborate a bit more. > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > provided by show_free_pages and memory allocation profiling info can help > > us debug cgoom by comparing them with historical data. What is your take on > > this? > > > > I am not really sure about show_free_areas(). More specifically how the > historical data diff will be useful for a memcg oom. If you have a > concrete example, please give one. For memory allocation profiling, is > it possible to filter for the given memcg? Do we save memcg information > in the memory allocation profiling? No, memory allocation profiling is not cgroup-aware. It tracks allocations and their code locations but no other context.
On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote: > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > can have a large cgroup workload running which dominates the machine. > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > information for the whole server as well. This is reason behind this > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > information for people? I am happy with any suggestions! > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > and there is no need for cover letter. > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > for the users in general? You mentioned memory allocation information, > > > > can you please elaborate a bit more. > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > provided by show_free_pages and memory allocation profiling info can help > > > us debug cgoom by comparing them with historical data. What is your take on > > > this? > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > historical data diff will be useful for a memcg oom. If you have a > > concrete example, please give one. For memory allocation profiling, is > > it possible to filter for the given memcg? Do we save memcg information > > in the memory allocation profiling? > > No, memory allocation profiling is not cgroup-aware. It tracks > allocations and their code locations but no other context. Thanks for the info. Pan, will having memcg info along with allocation profile help your use-case? (Though adding that might not be easy or cheaper)
On Thu, Aug 21, 2025 at 02:26:42PM -0700, Shakeel Butt wrote: > On Thu, Aug 21, 2025 at 01:00:36PM -0700, Suren Baghdasaryan wrote: > > On Thu, Aug 21, 2025 at 12:53 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Thu, Aug 21, 2025 at 12:18:00PM -0700, Yueyang Pan wrote: > > > > On Thu, Aug 21, 2025 at 11:35:19AM -0700, Shakeel Butt wrote: > > > > > On Thu, Aug 14, 2025 at 10:11:56AM -0700, Yueyang Pan wrote: > > > > > > Right now in the oom_kill_process if the oom is because of the cgroup > > > > > > limit, we won't get memory allocation infomation. In some cases, we > > > > > > can have a large cgroup workload running which dominates the machine. > > > > > > The reason using cgroup is to leave some resource for system. When this > > > > > > cgroup is killed, we would also like to have some memory allocation > > > > > > information for the whole server as well. This is reason behind this > > > > > > mini change. Is it an acceptable thing to do? Will it be too much > > > > > > information for people? I am happy with any suggestions! > > > > > > > > > > For a single patch, it is better to have all the context in the patch > > > > > and there is no need for cover letter. > > > > > > > > Thanks for your suggestion Shakeel! I will change this in the next version. > > > > > > > > > > > > > > What exact information you want on the memcg oom that will be helpful > > > > > for the users in general? You mentioned memory allocation information, > > > > > can you please elaborate a bit more. > > > > > > > > > > > > > As in my reply to Suren, I was thinking the system-wide memory usage info > > > > provided by show_free_pages and memory allocation profiling info can help > > > > us debug cgoom by comparing them with historical data. What is your take on > > > > this? > > > > > > > > > > I am not really sure about show_free_areas(). More specifically how the > > > historical data diff will be useful for a memcg oom. If you have a > > > concrete example, please give one. For memory allocation profiling, is > > > it possible to filter for the given memcg? Do we save memcg information > > > in the memory allocation profiling? > > > > No, memory allocation profiling is not cgroup-aware. It tracks > > allocations and their code locations but no other context. > > Thanks for the info. Pan, will having memcg info along with allocation > profile help your use-case? (Though adding that might not be easy or > cheaper) Yeah I have been thinking about it with eBPF hooks but it is going to be a long term effort as we need to measure the overhead. Now the way memory profiling is implemented incur almost "zero" overhead.
© 2016 - 2025 Red Hat, Inc.