[v3] memcg: Support per-memcg KSM metrics

[PATCH linux-next v3 0/6] memcg: Support per-memcg KSM metrics

Posted by xu.xin16@zte.com.cn 1 week, 3 days ago

From: xu xin <xu.xin16@zte.com.cn>

v2->v3:
------
Some fixes of compilation error due to missed inclusion of header or missed
function definition on some kernel config.
https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/
https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/

v1->v2:
------
According to Shakeel's suggestion, expose these metric item into memory.stat
instead of a new interface.
https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/

Background
==========

With the enablement of container-level KSM (e.g., via prctl [1]), there is
a growing demand for container-level observability of KSM behavior. However,
current cgroup implementations lack support for exposing KSM-related metrics.

So add the counter in the existing memory.stat without adding a new interface.
To diaplay per-memcg KSM statistic counters,  we traverse all processes of a
memcg and summing the processes' ksm_rmap_items counters instead of adding enum
item in memcg_stat_item or node_stat_item and updating the corresponding enum
counter when ksmd manipulate pages.

Now Linux users can look up all per-memcg KSM counters by:

# cat /sys/fs/cgroup/xuxin/memory.stat | grep ksm
ksm_rmap_items 0
ksm_zero_pages 0
ksm_merging_pages 0
ksm_profit 0

Q&A
====
why don't I add enum item in memcg_stat_item or node_stat_item like
other items in memory.stat ?

I tried the way of adding enum item in memcg_stat_item and updating them when
ksmd manipulate pages, but it failed with error statistic ksm counters of
memcg. This is because of the following reasons:

1) The KSM counter of memcgroup can be correctly incremented, but cannot be
properly decremented. E,g,, when ksmd scans pages of a process, it can use
the mm_struct of the struct ksm_rmap_item to reverse-lookup the memcg
and then increase the value via mod_memcg_state(memcg, MEMCG_KSM_COUNT, 1).
However, when the process exits abruptly, since ksmd asynchronously scans
the mmslot list in the background, it is no longer able to correctly locate
the original memcg through mm_struct by get_mem_cgroup_from_mm(), as the
task_struct has already been freed.

2) The first issue could potentially be addressed by adding a memcg
pointer directly into the ksm_rmap_item structure. However, this
increases memory overhead, especially when there are a large
number of ksm_rmap_items in the system (due to a high volume of
pages being scanned by ksmd). Moreover, this approach does not
resolve the same problem for ksm_zero_pages, because updates to
ksm_zero_pages are not performed through ksm_rmap_item, but
rather directly during unmap or page table entry (pte) faults
based on the mm_struct. At that point, if the process has
already exited, the corresponding memcg can no longer be
accurately identified.

xu xin (6):
  memcg: add per-memcg ksm_rmap_items stat
  memcg: show ksm_zero_pages count in memory.stat
  memcg: show ksm_merging_pages in memory.stat
  ksm: make ksm_process_profit available on CONFIG_PROCFS=n
  memcg: add per-memcg ksm_profit
  Documentation: add KSM statistic counters description in cgroup-v2.rst

 Documentation/admin-guide/cgroup-v2.rst | 17 ++++++
 include/linux/ksm.h                     |  1 +
 mm/ksm.c                                | 70 ++++++++++++++++++++++---
 mm/memcontrol.c                         |  5 ++
 4 files changed, 85 insertions(+), 8 deletions(-)

-- 
2.25.1

Re: [PATCH linux-next v3 0/6] memcg: Support per-memcg KSM metrics

Posted by Shakeel Butt 1 week, 1 day ago

Hi Xu,

On Sun, Sep 21, 2025 at 11:07:26PM +0800, xu.xin16@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> v2->v3:
> ------
> Some fixes of compilation error due to missed inclusion of header or missed
> function definition on some kernel config.
> https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/
> https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/
> 
> v1->v2:
> ------
> According to Shakeel's suggestion, expose these metric item into memory.stat
> instead of a new interface.
> https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/
> 
> Background
> ==========
> 
> With the enablement of container-level KSM (e.g., via prctl [1]), there is
> a growing demand for container-level observability of KSM behavior. However,
> current cgroup implementations lack support for exposing KSM-related metrics.
> 
> So add the counter in the existing memory.stat without adding a new interface.
> To diaplay per-memcg KSM statistic counters,  we traverse all processes of a
> memcg and summing the processes' ksm_rmap_items counters instead of adding enum
> item in memcg_stat_item or node_stat_item and updating the corresponding enum
> counter when ksmd manipulate pages.
> 
> Now Linux users can look up all per-memcg KSM counters by:
> 
> # cat /sys/fs/cgroup/xuxin/memory.stat | grep ksm
> ksm_rmap_items 0
> ksm_zero_pages 0
> ksm_merging_pages 0
> ksm_profit 0
> 
> Q&A
> ====
> why don't I add enum item in memcg_stat_item or node_stat_item like
> other items in memory.stat ?
> 
> I tried the way of adding enum item in memcg_stat_item and updating them when
> ksmd manipulate pages, but it failed with error statistic ksm counters of
> memcg. This is because of the following reasons:
> 
> 1) The KSM counter of memcgroup can be correctly incremented, but cannot be
> properly decremented. E,g,, when ksmd scans pages of a process, it can use
> the mm_struct of the struct ksm_rmap_item to reverse-lookup the memcg
> and then increase the value via mod_memcg_state(memcg, MEMCG_KSM_COUNT, 1).
> However, when the process exits abruptly, since ksmd asynchronously scans
> the mmslot list in the background, it is no longer able to correctly locate
> the original memcg through mm_struct by get_mem_cgroup_from_mm(), as the
> task_struct has already been freed.
> 
> 2) The first issue could potentially be addressed by adding a memcg
> pointer directly into the ksm_rmap_item structure. However, this
> increases memory overhead, especially when there are a large
> number of ksm_rmap_items in the system (due to a high volume of
> pages being scanned by ksmd). Moreover, this approach does not
> resolve the same problem for ksm_zero_pages, because updates to
> ksm_zero_pages are not performed through ksm_rmap_item, but
> rather directly during unmap or page table entry (pte) faults
> based on the mm_struct. At that point, if the process has
> already exited, the corresponding memcg can no longer be
> accurately identified.
>

Thanks for writing this up and sorry to disappoint you but this
explanation is giving me more reasons that memcg is not the right place
for these stats.

If you take a look at the memcg stats exposed through memory.stat, there
are two generally two types. First are the ones that describe the type
or property of the underlying memory and that memory is associated or
charged to the memcg e.g. anon or file or kernel (and other types)
memory. Please note that this memory lifetime can be independent from
the process that might have allocated them.

Second are the events that are faced by the processes in that
memcg like page faults, reclaim etc.

The ksm stats are about the process and not about the memcg of the
process. Process jumping from one memcg to another will take all these
stats with it. You can easily get ksm stats in userspace by traversing
/proc/pids/ksm_stats with the pids from cgroup.procs. You are just
looking for an easier way to get such stats instead of manual traversal.
I would suggest exploring cgroup iter based bpf program which can do
the stats collect and expose to userspace for a given cgroup hierarchy.

Re: [PATCH linux-next v3 0/6] memcg: Support per-memcg KSM metrics

Posted by David Hildenbrand 1 week, 1 day ago

On 21.09.25 17:07, xu.xin16@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> v2->v3:
> ------
> Some fixes of compilation error due to missed inclusion of header or missed
> function definition on some kernel config.
> https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/
> https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/
> 
> v1->v2:
> ------
> According to Shakeel's suggestion, expose these metric item into memory.stat
> instead of a new interface.
> https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/
> 
> Background
> ==========
> 
> With the enablement of container-level KSM (e.g., via prctl [1]), there is
> a growing demand for container-level observability of KSM behavior. However,
> current cgroup implementations lack support for exposing KSM-related metrics.
> 
> So add the counter in the existing memory.stat without adding a new interface.
> To diaplay per-memcg KSM statistic counters,  we traverse all processes of a
> memcg and summing the processes' ksm_rmap_items counters instead of adding enum
> item in memcg_stat_item or node_stat_item and updating the corresponding enum
> counter when ksmd manipulate pages.
> 
> Now Linux users can look up all per-memcg KSM counters by:
> 
> # cat /sys/fs/cgroup/xuxin/memory.stat | grep ksm
> ksm_rmap_items 0
> ksm_zero_pages 0
> ksm_merging_pages 0
> ksm_profit 0


No strong opinion from my side: seems to mostly only collect stats from 
all tasks to summarize them per memcg.

-- 
Cheers

David / dhildenb

Re: [PATCH linux-next v3 0/6] memcg: Support per-memcg KSM metrics

Posted by Michal Hocko 1 week, 2 days ago

On Sun 21-09-25 23:07:26, xu.xin16@zte.com.cn wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> v2->v3:
> ------
> Some fixes of compilation error due to missed inclusion of header or missed
> function definition on some kernel config.
> https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/
> https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/
> 
> v1->v2:
> ------
> According to Shakeel's suggestion, expose these metric item into memory.stat
> instead of a new interface.
> https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/
> 
> Background
> ==========
> 
> With the enablement of container-level KSM (e.g., via prctl [1]), there is
> a growing demand for container-level observability of KSM behavior. However,
> current cgroup implementations lack support for exposing KSM-related metrics.

Could you be more specific why this is needed and what it will be used
for?
-- 
Michal Hocko
SUSE Labs

答复: [PATCH linux-next v3 0/6] memcg: Support per-memcg KSM metrics

Posted by xu.xin16@zte.com.cn 1 week, 2 days ago

> > From: xu xin <xu.xin16@zte.com.cn>
> > 
> > v2->v3:
> > ------
> > Some fixes of compilation error due to missed inclusion of header or missed
> > function definition on some kernel config.
> > https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/
> > https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/
> > 
> > v1->v2:
> > ------
> > According to Shakeel's suggestion, expose these metric item into memory.stat
> > instead of a new interface.
> > https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/
> > 
> > Background
> > ==========
> > 
> > With the enablement of container-level KSM (e.g., via prctl [1]), there is
> > a growing demand for container-level observability of KSM behavior. However,
> > current cgroup implementations lack support for exposing KSM-related metrics.
> 
> Could you be more specific why this is needed and what it will be used
> for?

Yes. Some Linux application developers or vendors are eager to deploy container-level
KSM feature in containers (docker, containerd or runc and so on). They have found
significant memory savings without needing to modify application source code as
before—for example, by adding prctl to enable KSM in the container’s startup
program. Processes within the container can inherit KSM attributes via fork,
allowing the entire container to have KSM enabled.  

However, in practice, not all containers benefit from KSM’s memory savings. Some
containers may have few identical pages but incur additional memory overhead due
to excessive ksm_rmap_items generation from KSM scanning. Therefore, we need to
provide a container-level KSM monitoring method, enabling users to adjust their
strategies based on actual KSM merging performance.

> -- 
> Michal Hocko
> SUSE Labs

Re: 答复: [PATCH linux-next v3 0/6] memcg: Support per-memcg KSM metrics

Posted by Michal Hocko 1 week, 1 day ago

On Mon 22-09-25 17:31:58, xu.xin16@zte.com.cn wrote:
> > > From: xu xin <xu.xin16@zte.com.cn>
> > > 
> > > v2->v3:
> > > ------
> > > Some fixes of compilation error due to missed inclusion of header or missed
> > > function definition on some kernel config.
> > > https://lore.kernel.org/all/202509142147.WQI0impC-lkp@intel.com/
> > > https://lore.kernel.org/all/202509142046.QatEaTQV-lkp@intel.com/
> > > 
> > > v1->v2:
> > > ------
> > > According to Shakeel's suggestion, expose these metric item into memory.stat
> > > instead of a new interface.
> > > https://lore.kernel.org/all/ir2s6sqi6hrbz7ghmfngbif6fbgmswhqdljlntesurfl2xvmmv@yp3w2lqyipb5/
> > > 
> > > Background
> > > ==========
> > > 
> > > With the enablement of container-level KSM (e.g., via prctl [1]), there is
> > > a growing demand for container-level observability of KSM behavior. However,
> > > current cgroup implementations lack support for exposing KSM-related metrics.
> > 
> > Could you be more specific why this is needed and what it will be used
> > for?
> 
> Yes. Some Linux application developers or vendors are eager to deploy container-level
> KSM feature in containers (docker, containerd or runc and so on). They have found
> significant memory savings without needing to modify application source code as
> before—for example, by adding prctl to enable KSM in the container’s startup
> program. Processes within the container can inherit KSM attributes via fork,
> allowing the entire container to have KSM enabled.  
> 
> However, in practice, not all containers benefit from KSM’s memory savings. Some
> containers may have few identical pages but incur additional memory overhead due
> to excessive ksm_rmap_items generation from KSM scanning. Therefore, we need to
> provide a container-level KSM monitoring method, enabling users to adjust their
> strategies based on actual KSM merging performance.

So what is the strategy here? You watch the runtime behavior and then
disable KSM based on previous run? I do not think this could be changed
during the runtime, rigtht? So it would only work for the next run and
that would rely that the workload is consistent in that over re-runs
right?

I am not really convinced TBH, but not as much as to NAK this. What
concerns me a bit is that these per memcg stats are slightly different
from global ones without a very good explanation (or maybe I have just
not understood it properly).

Also the usecase sounds a bit shaky as it doesn't really give admins
great control other than a hope that a new execution of the container
will behave consistently with previous runs. I thought the whole concept
of per process KSM is based on "we know our userspace benefits" rather
than "let's try and see". 

All in all I worry this will turn out not really used in the end and we
will have yet another counters to maintain without real users.

-- 
Michal Hocko
SUSE Labs