Documentation/admin-guide/cgroup-v2.rst | 17 ++++++ include/linux/ksm.h | 1 + mm/ksm.c | 70 ++++++++++++++++++++++--- mm/memcontrol.c | 5 ++ 4 files changed, 85 insertions(+), 8 deletions(-)
From: xu xin <xu.xin16@zte.com.cn> v2->v3: ------ Some fixes of compilation error due to missed inclusion of header or missed function definition on some kernel config. https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/ https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/ v1->v2: ------ According to Shakeel's suggestion, expose these metric item into memory.stat instead of a new interface. https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/ Background ========== With the enablement of container-level KSM (e.g., via prctl [1]), there is a growing demand for container-level observability of KSM behavior. However, current cgroup implementations lack support for exposing KSM-related metrics. So add the counter in the existing memory.stat without adding a new interface. To diaplay per-memcg KSM statistic counters, we traverse all processes of a memcg and summing the processes' ksm_rmap_items counters instead of adding enum item in memcg_stat_item or node_stat_item and updating the corresponding enum counter when ksmd manipulate pages. Now Linux users can look up all per-memcg KSM counters by: # cat /sys/fs/cgroup/xuxin/memory.stat | grep ksm ksm_rmap_items 0 ksm_zero_pages 0 ksm_merging_pages 0 ksm_profit 0 Q&A ==== why don't I add enum item in memcg_stat_item or node_stat_item like other items in memory.stat ? I tried the way of adding enum item in memcg_stat_item and updating them when ksmd manipulate pages, but it failed with error statistic ksm counters of memcg. This is because of the following reasons: 1) The KSM counter of memcgroup can be correctly incremented, but cannot be properly decremented. E,g,, when ksmd scans pages of a process, it can use the mm_struct of the struct ksm_rmap_item to reverse-lookup the memcg and then increase the value via mod_memcg_state(memcg, MEMCG_KSM_COUNT, 1). However, when the process exits abruptly, since ksmd asynchronously scans the mmslot list in the background, it is no longer able to correctly locate the original memcg through mm_struct by get_mem_cgroup_from_mm(), as the task_struct has already been freed. 2) The first issue could potentially be addressed by adding a memcg pointer directly into the ksm_rmap_item structure. However, this increases memory overhead, especially when there are a large number of ksm_rmap_items in the system (due to a high volume of pages being scanned by ksmd). Moreover, this approach does not resolve the same problem for ksm_zero_pages, because updates to ksm_zero_pages are not performed through ksm_rmap_item, but rather directly during unmap or page table entry (pte) faults based on the mm_struct. At that point, if the process has already exited, the corresponding memcg can no longer be accurately identified. xu xin (6): memcg: add per-memcg ksm_rmap_items stat memcg: show ksm_zero_pages count in memory.stat memcg: show ksm_merging_pages in memory.stat ksm: make ksm_process_profit available on CONFIG_PROCFS=n memcg: add per-memcg ksm_profit Documentation: add KSM statistic counters description in cgroup-v2.rst Documentation/admin-guide/cgroup-v2.rst | 17 ++++++ include/linux/ksm.h | 1 + mm/ksm.c | 70 ++++++++++++++++++++++--- mm/memcontrol.c | 5 ++ 4 files changed, 85 insertions(+), 8 deletions(-) -- 2.25.1
Hi Xu, On Sun, Sep 21, 2025 at 11:07:26PM +0800, xu.xin16@zte.com.cn wrote: > From: xu xin <xu.xin16@zte.com.cn> > > v2->v3: > ------ > Some fixes of compilation error due to missed inclusion of header or missed > function definition on some kernel config. > https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/ > https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/ > > v1->v2: > ------ > According to Shakeel's suggestion, expose these metric item into memory.stat > instead of a new interface. > https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/ > > Background > ========== > > With the enablement of container-level KSM (e.g., via prctl [1]), there is > a growing demand for container-level observability of KSM behavior. However, > current cgroup implementations lack support for exposing KSM-related metrics. > > So add the counter in the existing memory.stat without adding a new interface. > To diaplay per-memcg KSM statistic counters, we traverse all processes of a > memcg and summing the processes' ksm_rmap_items counters instead of adding enum > item in memcg_stat_item or node_stat_item and updating the corresponding enum > counter when ksmd manipulate pages. > > Now Linux users can look up all per-memcg KSM counters by: > > # cat /sys/fs/cgroup/xuxin/memory.stat | grep ksm > ksm_rmap_items 0 > ksm_zero_pages 0 > ksm_merging_pages 0 > ksm_profit 0 > > Q&A > ==== > why don't I add enum item in memcg_stat_item or node_stat_item like > other items in memory.stat ? > > I tried the way of adding enum item in memcg_stat_item and updating them when > ksmd manipulate pages, but it failed with error statistic ksm counters of > memcg. This is because of the following reasons: > > 1) The KSM counter of memcgroup can be correctly incremented, but cannot be > properly decremented. E,g,, when ksmd scans pages of a process, it can use > the mm_struct of the struct ksm_rmap_item to reverse-lookup the memcg > and then increase the value via mod_memcg_state(memcg, MEMCG_KSM_COUNT, 1). > However, when the process exits abruptly, since ksmd asynchronously scans > the mmslot list in the background, it is no longer able to correctly locate > the original memcg through mm_struct by get_mem_cgroup_from_mm(), as the > task_struct has already been freed. > > 2) The first issue could potentially be addressed by adding a memcg > pointer directly into the ksm_rmap_item structure. However, this > increases memory overhead, especially when there are a large > number of ksm_rmap_items in the system (due to a high volume of > pages being scanned by ksmd). Moreover, this approach does not > resolve the same problem for ksm_zero_pages, because updates to > ksm_zero_pages are not performed through ksm_rmap_item, but > rather directly during unmap or page table entry (pte) faults > based on the mm_struct. At that point, if the process has > already exited, the corresponding memcg can no longer be > accurately identified. > Thanks for writing this up and sorry to disappoint you but this explanation is giving me more reasons that memcg is not the right place for these stats. If you take a look at the memcg stats exposed through memory.stat, there are two generally two types. First are the ones that describe the type or property of the underlying memory and that memory is associated or charged to the memcg e.g. anon or file or kernel (and other types) memory. Please note that this memory lifetime can be independent from the process that might have allocated them. Second are the events that are faced by the processes in that memcg like page faults, reclaim etc. The ksm stats are about the process and not about the memcg of the process. Process jumping from one memcg to another will take all these stats with it. You can easily get ksm stats in userspace by traversing /proc/pids/ksm_stats with the pids from cgroup.procs. You are just looking for an easier way to get such stats instead of manual traversal. I would suggest exploring cgroup iter based bpf program which can do the stats collect and expose to userspace for a given cgroup hierarchy.
On 21.09.25 17:07, xu.xin16@zte.com.cn wrote: > From: xu xin <xu.xin16@zte.com.cn> > > v2->v3: > ------ > Some fixes of compilation error due to missed inclusion of header or missed > function definition on some kernel config. > https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/ > https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/ > > v1->v2: > ------ > According to Shakeel's suggestion, expose these metric item into memory.stat > instead of a new interface. > https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/ > > Background > ========== > > With the enablement of container-level KSM (e.g., via prctl [1]), there is > a growing demand for container-level observability of KSM behavior. However, > current cgroup implementations lack support for exposing KSM-related metrics. > > So add the counter in the existing memory.stat without adding a new interface. > To diaplay per-memcg KSM statistic counters, we traverse all processes of a > memcg and summing the processes' ksm_rmap_items counters instead of adding enum > item in memcg_stat_item or node_stat_item and updating the corresponding enum > counter when ksmd manipulate pages. > > Now Linux users can look up all per-memcg KSM counters by: > > # cat /sys/fs/cgroup/xuxin/memory.stat | grep ksm > ksm_rmap_items 0 > ksm_zero_pages 0 > ksm_merging_pages 0 > ksm_profit 0 No strong opinion from my side: seems to mostly only collect stats from all tasks to summarize them per memcg. -- Cheers David / dhildenb
On Sun 21-09-25 23:07:26, xu.xin16@zte.com.cn wrote: > From: xu xin <xu.xin16@zte.com.cn> > > v2->v3: > ------ > Some fixes of compilation error due to missed inclusion of header or missed > function definition on some kernel config. > https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/ > https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/ > > v1->v2: > ------ > According to Shakeel's suggestion, expose these metric item into memory.stat > instead of a new interface. > https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/ > > Background > ========== > > With the enablement of container-level KSM (e.g., via prctl [1]), there is > a growing demand for container-level observability of KSM behavior. However, > current cgroup implementations lack support for exposing KSM-related metrics. Could you be more specific why this is needed and what it will be used for? -- Michal Hocko SUSE Labs
> > From: xu xin <xu.xin16@zte.com.cn> > > > > v2->v3: > > ------ > > Some fixes of compilation error due to missed inclusion of header or missed > > function definition on some kernel config. > > https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/ > > https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/ > > > > v1->v2: > > ------ > > According to Shakeel's suggestion, expose these metric item into memory.stat > > instead of a new interface. > > https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/ > > > > Background > > ========== > > > > With the enablement of container-level KSM (e.g., via prctl [1]), there is > > a growing demand for container-level observability of KSM behavior. However, > > current cgroup implementations lack support for exposing KSM-related metrics. > > Could you be more specific why this is needed and what it will be used > for? Yes. Some Linux application developers or vendors are eager to deploy container-level KSM feature in containers (docker, containerd or runc and so on). They have found significant memory savings without needing to modify application source code as before—for example, by adding prctl to enable KSM in the container’s startup program. Processes within the container can inherit KSM attributes via fork, allowing the entire container to have KSM enabled. However, in practice, not all containers benefit from KSM’s memory savings. Some containers may have few identical pages but incur additional memory overhead due to excessive ksm_rmap_items generation from KSM scanning. Therefore, we need to provide a container-level KSM monitoring method, enabling users to adjust their strategies based on actual KSM merging performance. > -- > Michal Hocko > SUSE Labs
On Mon 22-09-25 17:31:58, xu.xin16@zte.com.cn wrote: > > > From: xu xin <xu.xin16@zte.com.cn> > > > > > > v2->v3: > > > ------ > > > Some fixes of compilation error due to missed inclusion of header or missed > > > function definition on some kernel config. > > > https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/ > > > https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/ > > > > > > v1->v2: > > > ------ > > > According to Shakeel's suggestion, expose these metric item into memory.stat > > > instead of a new interface. > > > https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/ > > > > > > Background > > > ========== > > > > > > With the enablement of container-level KSM (e.g., via prctl [1]), there is > > > a growing demand for container-level observability of KSM behavior. However, > > > current cgroup implementations lack support for exposing KSM-related metrics. > > > > Could you be more specific why this is needed and what it will be used > > for? > > Yes. Some Linux application developers or vendors are eager to deploy container-level > KSM feature in containers (docker, containerd or runc and so on). They have found > significant memory savings without needing to modify application source code as > before—for example, by adding prctl to enable KSM in the container’s startup > program. Processes within the container can inherit KSM attributes via fork, > allowing the entire container to have KSM enabled. > > However, in practice, not all containers benefit from KSM’s memory savings. Some > containers may have few identical pages but incur additional memory overhead due > to excessive ksm_rmap_items generation from KSM scanning. Therefore, we need to > provide a container-level KSM monitoring method, enabling users to adjust their > strategies based on actual KSM merging performance. So what is the strategy here? You watch the runtime behavior and then disable KSM based on previous run? I do not think this could be changed during the runtime, rigtht? So it would only work for the next run and that would rely that the workload is consistent in that over re-runs right? I am not really convinced TBH, but not as much as to NAK this. What concerns me a bit is that these per memcg stats are slightly different from global ones without a very good explanation (or maybe I have just not understood it properly). Also the usecase sounds a bit shaky as it doesn't really give admins great control other than a hope that a new execution of the container will behave consistently with previous runs. I thought the whole concept of per process KSM is based on "we know our userspace benefits" rather than "let's try and see". All in all I worry this will turn out not really used in the end and we will have yet another counters to maintain without real users. -- Michal Hocko SUSE Labs
© 2016 - 2025 Red Hat, Inc.