[v4] mm: workingset reporting

[PATCH v4 0/9] mm: workingset reporting

Posted by Yuanchu Xie 1 year, 2 months ago

This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace.
Another interesting idea might be hugepage workingset, so that we can
measure the proportion of hugepages backing cold memory. However, with
architectures like arm, there may be too many hugepage sizes leading to
a combinatorial explosion when exporting stats to the userspace.
Nonetheless, the kernel should provide a set of workingset interfaces
that is generic enough to accommodate the various use cases, and extensible
to potential future use cases.

Use cases
==========
Job scheduling
On overcommitted hosts, workingset information improves efficiency and
reliability by allowing the job scheduler to have better stats on the
exact memory requirements of each job. This can manifest in efficiency by
landing more jobs on the same host or NUMA node. On the other hand, the
job scheduler can also ensure each node has a sufficient amount of memory
and does not enter direct reclaim or the kernel OOM path. With workingset
information and job priority, the userspace OOM killing or proactive
reclaim policy can kick in before the system is under memory pressure.
If the job shape is very different from the machine shape, knowing the
workingset per-node can also help inform page allocation policies.

Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting allows the policy to be more accurate and
flexible.

Ballooning (similar to proactive reclaim)
The last patch of the series extends the virtio-balloon device to report
the guest workingset.
Balloon policies benefit from workingset to more precisely determine the
size of the memory balloon. On end-user devices where memory is scarce and
overcommitted, the balloon sizing in multiple VMs running on the same
device can be orchestrated with workingset reports from each one.
On the server side, workingset reporting allows the balloon controller to
inflate the balloon without causing too much file cache to be reclaimed in
the guest.

Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].

[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1

Sysfs and Cgroup Interfaces
==========
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.

1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892

Implementation
==========
The reporting of user pages is based off of MGLRU, and therefore requires
CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
fine-grained workingset report, but we can already gather a lot of data
with just four generations. The workingset reporting mechanism is gated
behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
CONFIG_WORKINGSET_REPORT_AGING.

Benchmarks
==========
Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
compile and redis benchmarks from openbenchmarking.org. The policy and
runner is referred to as WMO (Workload Memory Optimization).
The results were based on v3 of the series, but v4 doesn't change the core
of the working set reporting and just adds the ballooning counterpart.

The timed Linux kernel compilation benchmark shows improvements in peak
memory usage with a policy of "swap out all bytes colder than 10 seconds
every 40 seconds". A swapfile is configured on SSD.
--------------------------------------------
peak memory usage (with WMO): 4982.61328 MiB
peak memory usage (control): 9569.1367 MiB
peak memory reduction: 47.9%
--------------------------------------------
Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
--------------------------------------------
Seconds, fewer is better

The redis benchmark shows employs the same policy:
--------------------------------------------
peak memory usage (with WMO): 375.9023 MiB
peak memory usage (control): 509.765 MiB
peak memory reduction: 26%
--------------------------------------------
Benchmark               | Experimental     | Control          | Experimental_Std_Dev | Control_Std_Dev
Redis - LPOP (Reqs/sec) | 2023130 (98.22%) | 2059849 (100%)   | 1.2%                 | 2%
Redis - SADD (Reqs/sec) | 2539662 (98.63%) | 2574811 (100%)   | 2.3%                 | 1.4%
Redis - LPUSH (Reqs/sec)| 2024880 (100%)   | 2000884 (98.81%) | 1.1%                 | 0.8%
Redis - GET (Reqs/sec)  | 2835764 (100%)   | 2763722 (97.46%) | 2.7%                 | 1.6%
Redis - SET (Reqs/sec)  | 2340723 (100%)   | 2327372 (99.43%) | 2.4%                 | 1.8%
--------------------------------------------
Reqs/sec, more is better

The detailed report and benchmarking results are in Ghait's repo:
https://github.com/miloudi98/WMO

Changelog
==========

Changes from PATCH v3 -> v4:
- Added documentation for cgroup-v2
  (Waiman Long)
- Fixed types in documentation
  (Randy Dunlap)
- Added implementation for the ballooning use case
- Added detailed description of benchmark results
  (Andrew Morton)

Changes from PATCH v2 -> v3:
- Fixed typos in commit messages and documentation
  (Lance Yang, Randy Dunlap)
- Split out the force_scan patch to be reviewed separately
- Added benchmarks from Ghait Ouled Amar Ben Cheikh
- Fixed reported compile error without CONFIG_MEMCG

Changes from PATCH v1 -> v2:
- Updated selftest to use ksft_test_result_code instead of switch-case
  (Muhammad Usama Anjum)
- Included more use cases in the cover letter
  (Huang, Ying)
- Added documentation for sysfs and memcg interfaces
- Added an aging-specific struct lru_gen_mm_walk in struct pglist_data
  to avoid allocating for each lruvec.

[v1] https://lore.kernel.org/linux-mm/20240504073011.4000534-1-yuanchu@google.com/
[v2] https://lore.kernel.org/linux-mm/20240604020549.1017540-1-yuanchu@google.com/
[v3] https://lore.kernel.org/linux-mm/20240813165619.748102-1-yuanchu@google.com/

Yuanchu Xie (9):
  mm: aggregate workingset information into histograms
  mm: use refresh interval to rate-limit workingset report aggregation
  mm: report workingset during memory pressure driven scanning
  mm: extend workingset reporting to memcgs
  mm: add kernel aging thread for workingset reporting
  selftest: test system-wide workingset reporting
  Docs/admin-guide/mm/workingset_report: document sysfs and memcg
    interfaces
  Docs/admin-guide/cgroup-v2: document workingset reporting
  virtio-balloon: add workingset reporting

 Documentation/admin-guide/cgroup-v2.rst       |  35 +
 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/workingset_report.rst      | 105 +++
 drivers/base/node.c                           |   6 +
 drivers/virtio/virtio_balloon.c               | 390 ++++++++++-
 include/linux/balloon_compaction.h            |   1 +
 include/linux/memcontrol.h                    |  21 +
 include/linux/mmzone.h                        |  13 +
 include/linux/workingset_report.h             | 167 +++++
 include/uapi/linux/virtio_balloon.h           |  30 +
 mm/Kconfig                                    |  15 +
 mm/Makefile                                   |   2 +
 mm/internal.h                                 |  19 +
 mm/memcontrol.c                               | 162 ++++-
 mm/mm_init.c                                  |   2 +
 mm/mmzone.c                                   |   2 +
 mm/vmscan.c                                   |  56 +-
 mm/workingset_report.c                        | 653 ++++++++++++++++++
 mm/workingset_report_aging.c                  | 127 ++++
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   3 +
 tools/testing/selftests/mm/run_vmtests.sh     |   5 +
 .../testing/selftests/mm/workingset_report.c  | 306 ++++++++
 .../testing/selftests/mm/workingset_report.h  |  39 ++
 .../selftests/mm/workingset_report_test.c     | 330 +++++++++
 25 files changed, 2482 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
 create mode 100644 include/linux/workingset_report.h
 create mode 100644 mm/workingset_report.c
 create mode 100644 mm/workingset_report_aging.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.h
 create mode 100644 tools/testing/selftests/mm/workingset_report_test.c

-- 
2.47.0.338.g60cca15819-goog

Re: [PATCH v4 0/9] mm: workingset reporting

Posted by Johannes Weiner 1 year, 2 months ago

On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> This patch series provides workingset reporting of user pages in
> lruvecs, of which coldness can be tracked by accessed bits and fd
> references. However, the concept of workingset applies generically to
> all types of memory, which could be kernel slab caches, discardable
> userspace caches (databases), or CXL.mem. Therefore, data sources might
> come from slab shrinkers, device drivers, or the userspace.
> Another interesting idea might be hugepage workingset, so that we can
> measure the proportion of hugepages backing cold memory. However, with
> architectures like arm, there may be too many hugepage sizes leading to
> a combinatorial explosion when exporting stats to the userspace.
> Nonetheless, the kernel should provide a set of workingset interfaces
> that is generic enough to accommodate the various use cases, and extensible
> to potential future use cases.

Doesn't DAMON already provide this information?

CCing SJ.

> Use cases
> ==========
> Job scheduling
> On overcommitted hosts, workingset information improves efficiency and
> reliability by allowing the job scheduler to have better stats on the
> exact memory requirements of each job. This can manifest in efficiency by
> landing more jobs on the same host or NUMA node. On the other hand, the
> job scheduler can also ensure each node has a sufficient amount of memory
> and does not enter direct reclaim or the kernel OOM path. With workingset
> information and job priority, the userspace OOM killing or proactive
> reclaim policy can kick in before the system is under memory pressure.
> If the job shape is very different from the machine shape, knowing the
> workingset per-node can also help inform page allocation policies.
> 
> Proactive reclaim
> Workingset information allows the a container manager to proactively
> reclaim memory while not impacting a job's performance. While PSI may
> provide a reactive measure of when a proactive reclaim has reclaimed too
> much, workingset reporting allows the policy to be more accurate and
> flexible.

I'm not sure about more accurate.

Access frequency is only half the picture. Whether you need to keep
memory with a given frequency resident depends on the speed of the
backing device.

There is memory compression; there is swap on flash; swap on crappy
flash; swapfiles that share IOPS with co-located filesystems. There is
zswap+writeback, where avg refault speed can vary dramatically.

You can of course offload much more to a fast zswap backend than to a
swapfile on a struggling flashdrive, with comparable app performance.

So I think you'd be hard pressed to achieve a high level of accuracy
in the usecases you list without taking the (often highly dynamic)
cost of paging / memory transfer into account.

There is a more detailed discussion of this in a paper we wrote on
proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:

https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf

> Ballooning (similar to proactive reclaim)
> The last patch of the series extends the virtio-balloon device to report
> the guest workingset.
> Balloon policies benefit from workingset to more precisely determine the
> size of the memory balloon. On end-user devices where memory is scarce and
> overcommitted, the balloon sizing in multiple VMs running on the same
> device can be orchestrated with workingset reports from each one.
> On the server side, workingset reporting allows the balloon controller to
> inflate the balloon without causing too much file cache to be reclaimed in
> the guest.
> 
> Promotion/Demotion
> If different mechanisms are used for promition and demotion, workingset
> information can help connect the two and avoid pages being migrated back
> and forth.
> For example, given a promotion hot page threshold defined in reaccess
> distance of N seconds (promote pages accessed more often than every N
> seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> the fast memory node passes the threshold. This calculation can be done
> with workingset reports.
> To be directly useful for promotion policies, the workingset report
> interfaces need to be extended to report hotness and gather hotness
> information from the devices[1].
> 
> [1]
> https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
>
> Sysfs and Cgroup Interfaces
> ==========
> The interfaces are detailed in the patches that introduce them. The main
> idea here is we break down the workingset per-node per-memcg into time
> intervals (ms), e.g.
> 
> 1000 anon=137368 file=24530
> 20000 anon=34342 file=0
> 30000 anon=353232 file=333608
> 40000 anon=407198 file=206052
> 9223372036854775807 anon=4925624 file=892892
>
> Implementation
> ==========
> The reporting of user pages is based off of MGLRU, and therefore requires
> CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
> fine-grained workingset report, but we can already gather a lot of data
> with just four generations. The workingset reporting mechanism is gated
> behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
> CONFIG_WORKINGSET_REPORT_AGING.
> 
> Benchmarks
> ==========
> Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> compile and redis benchmarks from openbenchmarking.org. The policy and
> runner is referred to as WMO (Workload Memory Optimization).
> The results were based on v3 of the series, but v4 doesn't change the core
> of the working set reporting and just adds the ballooning counterpart.
> 
> The timed Linux kernel compilation benchmark shows improvements in peak
> memory usage with a policy of "swap out all bytes colder than 10 seconds
> every 40 seconds". A swapfile is configured on SSD.
> --------------------------------------------
> peak memory usage (with WMO): 4982.61328 MiB
> peak memory usage (control): 9569.1367 MiB
> peak memory reduction: 47.9%
> --------------------------------------------
> Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
> Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
> --------------------------------------------
> Seconds, fewer is better

You can do this with a recent (>2018) upstream kernel and ~100 lines
of python [1]. It also works on both LRU implementations.

[1] https://github.com/facebookincubator/senpai

We use this approach in virtually the entire Meta fleet, to offload
unneeded memory, estimate available capacity for job scheduling, plan
future capacity needs, and provide accurate memory usage feedback to
application developers.

It works over a wide variety of CPU and storage configurations with no
specific tuning.

The paper I referenced above provides a detailed breakdown of how it
all works together.

I would be curious to see a more in-depth comparison to the prior art
in this space. At first glance, your proposal seems more complex and
less robust/versatile, at least for offloading and capacity gauging.

It does provide more detailed insight into userspace memory behavior,
which could be helpful when trying to make sense of applications that
sit on a rich layer of libraries and complicated runtimes. But here a
comparison to DAMON would be helpful.

>  25 files changed, 2482 insertions(+), 9 deletions(-)
>  create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
>  create mode 100644 include/linux/workingset_report.h
>  create mode 100644 mm/workingset_report.c
>  create mode 100644 mm/workingset_report_aging.c
>  create mode 100644 tools/testing/selftests/mm/workingset_report.c
>  create mode 100644 tools/testing/selftests/mm/workingset_report.h
>  create mode 100644 tools/testing/selftests/mm/workingset_report_test.c

Re: [PATCH v4 0/9] mm: workingset reporting

Posted by Yuanchu Xie 1 year, 2 months ago

Thanks for the response Johannes. Some replies inline.

On Tue, Nov 26, 2024 at 11:26 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extensible
> > to potential future use cases.
>
> Doesn't DAMON already provide this information?
>
> CCing SJ.
Thanks for the CC. DAMON was really good at visualizing the memory
access frequencies last time I tried it out! For server use cases,
DAMON would benefit from integrations with cgroups. The key then would
be a standard interface for exporting a cgroup's working set to the
user. It would be good to have something that will work for different
backing implementations, DAMON, MGLRU, or active/inactive LRU.

>
> > Use cases
> > ==========
> > Job scheduling
> > On overcommitted hosts, workingset information improves efficiency and
> > reliability by allowing the job scheduler to have better stats on the
> > exact memory requirements of each job. This can manifest in efficiency by
> > landing more jobs on the same host or NUMA node. On the other hand, the
> > job scheduler can also ensure each node has a sufficient amount of memory
> > and does not enter direct reclaim or the kernel OOM path. With workingset
> > information and job priority, the userspace OOM killing or proactive
> > reclaim policy can kick in before the system is under memory pressure.
> > If the job shape is very different from the machine shape, knowing the
> > workingset per-node can also help inform page allocation policies.
> >
> > Proactive reclaim
> > Workingset information allows the a container manager to proactively
> > reclaim memory while not impacting a job's performance. While PSI may
> > provide a reactive measure of when a proactive reclaim has reclaimed too
> > much, workingset reporting allows the policy to be more accurate and
> > flexible.
>
> I'm not sure about more accurate.
>
> Access frequency is only half the picture. Whether you need to keep
> memory with a given frequency resident depends on the speed of the
> backing device.
>
> There is memory compression; there is swap on flash; swap on crappy
> flash; swapfiles that share IOPS with co-located filesystems. There is
> zswap+writeback, where avg refault speed can vary dramatically.
>
> You can of course offload much more to a fast zswap backend than to a
> swapfile on a struggling flashdrive, with comparable app performance.
>
> So I think you'd be hard pressed to achieve a high level of accuracy
> in the usecases you list without taking the (often highly dynamic)
> cost of paging / memory transfer into account.
>
> There is a more detailed discussion of this in a paper we wrote on
> proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:
>
> https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf
>
Yes, PSI takes into account the paging cost. I'm not claiming that
Workingset reporting provides a superset of information, but rather it
can complement PSI. Sorry for the bad wording here.

> > Ballooning (similar to proactive reclaim)
> > The last patch of the series extends the virtio-balloon device to report
> > the guest workingset.
> > Balloon policies benefit from workingset to more precisely determine the
> > size of the memory balloon. On end-user devices where memory is scarce and
> > overcommitted, the balloon sizing in multiple VMs running on the same
> > device can be orchestrated with workingset reports from each one.
> > On the server side, workingset reporting allows the balloon controller to
> > inflate the balloon without causing too much file cache to be reclaimed in
> > the guest.
The ballooning use case is an important one. Having working set
information would allow us to inflate a balloon of the right size in
the guest.

> >
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >...
> >
> > Benchmarks
> > ==========
> > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> > compile and redis benchmarks from openbenchmarking.org. The policy and
> > runner is referred to as WMO (Workload Memory Optimization).
> > The results were based on v3 of the series, but v4 doesn't change the core
> > of the working set reporting and just adds the ballooning counterpart.
> >
> > The timed Linux kernel compilation benchmark shows improvements in peak
> > memory usage with a policy of "swap out all bytes colder than 10 seconds
> > every 40 seconds". A swapfile is configured on SSD.
> > --------------------------------------------
> > peak memory usage (with WMO): 4982.61328 MiB
> > peak memory usage (control): 9569.1367 MiB
> > peak memory reduction: 47.9%
> > --------------------------------------------
> > Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
> > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
> > --------------------------------------------
> > Seconds, fewer is better
>
> You can do this with a recent (>2018) upstream kernel and ~100 lines
> of python [1]. It also works on both LRU implementations.
>
> [1] https://github.com/facebookincubator/senpai
>
> We use this approach in virtually the entire Meta fleet, to offload
> unneeded memory, estimate available capacity for job scheduling, plan
> future capacity needs, and provide accurate memory usage feedback to
> application developers.
>
> It works over a wide variety of CPU and storage configurations with no
> specific tuning.
>
> The paper I referenced above provides a detailed breakdown of how it
> all works together.
>
> I would be curious to see a more in-depth comparison to the prior art
> in this space. At first glance, your proposal seems more complex and
> less robust/versatile, at least for offloading and capacity gauging.
We have implemented TMO PSI-based proactive reclaim and compared it to
a kstaled-based reclaimer (reclaiming based on 2 minute working set
and refaults). The PSI-based reclaimer was able to save more memory,
but it also caused spikes of refaults and a lot higher
decompressions/second. Overall the test workloads had better
performance with the kstaled-based reclaimer. The conclusion was that
it was a trade-off. Since we have some app classes that we don't want
to induce pressure but still want to proactively reclaim from, there's
a missing piece. I do agree there's not a good in-depth comparison
with prior art though.

Re: [PATCH v4 0/9] mm: workingset reporting

Posted by SeongJae Park 1 year, 1 month ago

On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie <yuanchu@google.com> wrote:

> Thanks for the response Johannes. Some replies inline.
> 
> On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > > This patch series provides workingset reporting of user pages in
> > > lruvecs, of which coldness can be tracked by accessed bits and fd
> > > references. However, the concept of workingset applies generically to
> > > all types of memory, which could be kernel slab caches, discardable
> > > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > > come from slab shrinkers, device drivers, or the userspace.
> > > Another interesting idea might be hugepage workingset, so that we can
> > > measure the proportion of hugepages backing cold memory. However, with
> > > architectures like arm, there may be too many hugepage sizes leading to
> > > a combinatorial explosion when exporting stats to the userspace.
> > > Nonetheless, the kernel should provide a set of workingset interfaces
> > > that is generic enough to accommodate the various use cases, and extensible
> > > to potential future use cases.
> >
> > Doesn't DAMON already provide this information?
> >
> > CCing SJ.
> Thanks for the CC. DAMON was really good at visualizing the memory
> access frequencies last time I tried it out!

Thank you for this kind acknowledgement, Yuanchu!

> For server use cases,
> DAMON would benefit from integrations with cgroups.  The key then would be a
> standard interface for exporting a cgroup's working set to the user.

I show two ways to make DAMON supports cgroups for now.  First way is making
another DAMON operations set implementation for cgroups.  I shared a rough idea
for this before, probably on kernel summit.  But I haven't had a chance to
prioritize this so far.  Please let me know if you need more details.  The
second way is extending DAMOS filter to provide more detailed statistics per
DAMON-region, and adding another DAMOS action that does nothing but only
accounting the detailed statistics.  Using the new DAMOS action, users will be
able to know how much of specific DAMON-found regions are filtered out by the
given filter.  Because we have DAMOS filter type for cgroups, we can know how
much of workingset (or, warm memory) belongs to specific groups.  This can be
applied to not only cgroups, but for any DAMOS filter types that exist (e.g.,
anonymous page, young page).

I believe the second way is simpler to implement while providing information
that sufficient for most possible use cases.  I was anyway planning to do this.

> It would be good to have something that will work for different
> backing implementations, DAMON, MGLRU, or active/inactive LRU.

I think we can do this using the filter statistics, with new filter types.  For
example, we can add new DAMOS filter that filters pages if it is for specific
range of MGLRU-gen of the page, or whether the page belongs to active or
inactive LRU lists.

> 
> >
> > > Use cases
> > > ==========
[...]
> > Access frequency is only half the picture. Whether you need to keep
> > memory with a given frequency resident depends on the speed of the
> > backing device.
[...]
> > > Benchmarks
> > > ==========
> > > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> > > compile and redis benchmarks from openbenchmarking.org. The policy and
> > > runner is referred to as WMO (Workload Memory Optimization).
> > > The results were based on v3 of the series, but v4 doesn't change the core
> > > of the working set reporting and just adds the ballooning counterpart.
> > >
> > > The timed Linux kernel compilation benchmark shows improvements in peak
> > > memory usage with a policy of "swap out all bytes colder than 10 seconds
> > > every 40 seconds". A swapfile is configured on SSD.
[...]
> > You can do this with a recent (>2018) upstream kernel and ~100 lines
> > of python [1]. It also works on both LRU implementations.
> >
> > [1] https://github.com/facebookincubator/senpai
> >
> > We use this approach in virtually the entire Meta fleet, to offload
> > unneeded memory, estimate available capacity for job scheduling, plan
> > future capacity needs, and provide accurate memory usage feedback to
> > application developers.
> >
> > It works over a wide variety of CPU and storage configurations with no
> > specific tuning.
> >
> > The paper I referenced above provides a detailed breakdown of how it
> > all works together.
> >
> > I would be curious to see a more in-depth comparison to the prior art
> > in this space. At first glance, your proposal seems more complex and
> > less robust/versatile, at least for offloading and capacity gauging.
> We have implemented TMO PSI-based proactive reclaim and compared it to
> a kstaled-based reclaimer (reclaiming based on 2 minute working set
> and refaults). The PSI-based reclaimer was able to save more memory,
> but it also caused spikes of refaults and a lot higher
> decompressions/second. Overall the test workloads had better
> performance with the kstaled-based reclaimer. The conclusion was that
> it was a trade-off.

I agree it is only half of the picture, and there could be tradeoff.  Motivated
by those previous works, DAMOS provides PSI-based aggressiveness auto-tuning to
use both ways.

> I do agree there's not a good in-depth comparison
> with prior art though.

I would be more than happy to help the comparison work agains DAMON of current
implementation and future plans, and any possible collaborations.


Thanks,
SJ

Re: [PATCH v4 0/9] mm: workingset reporting

Posted by Yuanchu Xie 1 year ago

On Wed, Dec 11, 2024 at 11:53 AM SeongJae Park <sj@kernel.org> wrote:
>
> On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie <yuanchu@google.com> wrote:
>
> > Thanks for the response Johannes. Some replies inline.
> >
> > On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > > > This patch series provides workingset reporting of user pages in
> > > > lruvecs, of which coldness can be tracked by accessed bits and fd
> > > > references. However, the concept of workingset applies generically to
> > > > all types of memory, which could be kernel slab caches, discardable
> > > > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > > > come from slab shrinkers, device drivers, or the userspace.
> > > > Another interesting idea might be hugepage workingset, so that we can
> > > > measure the proportion of hugepages backing cold memory. However, with
> > > > architectures like arm, there may be too many hugepage sizes leading to
> > > > a combinatorial explosion when exporting stats to the userspace.
> > > > Nonetheless, the kernel should provide a set of workingset interfaces
> > > > that is generic enough to accommodate the various use cases, and extensible
> > > > to potential future use cases.
> > >
> > > Doesn't DAMON already provide this information?
> > >
> > > CCing SJ.
> > Thanks for the CC. DAMON was really good at visualizing the memory
> > access frequencies last time I tried it out!
>
> Thank you for this kind acknowledgement, Yuanchu!
>
> > For server use cases,
> > DAMON would benefit from integrations with cgroups.  The key then would be a
> > standard interface for exporting a cgroup's working set to the user.
>
> I show two ways to make DAMON supports cgroups for now.  First way is making
> another DAMON operations set implementation for cgroups.  I shared a rough idea
> for this before, probably on kernel summit.  But I haven't had a chance to
> prioritize this so far.  Please let me know if you need more details.  The
> second way is extending DAMOS filter to provide more detailed statistics per
> DAMON-region, and adding another DAMOS action that does nothing but only
> accounting the detailed statistics.  Using the new DAMOS action, users will be
> able to know how much of specific DAMON-found regions are filtered out by the
> given filter.  Because we have DAMOS filter type for cgroups, we can know how
> much of workingset (or, warm memory) belongs to specific groups.  This can be
> applied to not only cgroups, but for any DAMOS filter types that exist (e.g.,
> anonymous page, young page).
>
> I believe the second way is simpler to implement while providing information
> that sufficient for most possible use cases.  I was anyway planning to do this.
For a container orchestrator like kubernetes, the node agents need to
be able to gather the working set stats at a per-job level. Some jobs
can create sub-hierarchies as well, so it's important that we have
hierarchical stats.

Do you think it's a good idea to integrate DAMON to provide some
aggregate stats in a memory controller file? With the DAMOS cgroup
filter, there can be some kind of interface that a DAMOS action or the
damo tool could call into. I feel that would be a straightforward and
integrated way to support cgroups.

Yuanchu

Re: [PATCH v4 0/9] mm: workingset reporting

Posted by SeongJae Park 1 year ago

Hi Yuanchu,

On Wed, 29 Jan 2025 18:02:26 -0800 Yuanchu Xie <yuanchu@google.com> wrote:

> On Wed, Dec 11, 2024 at 11:53 AM SeongJae Park <sj@kernel.org> wrote:
> >
> > On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie <yuanchu@google.com> wrote:
> >
> > > Thanks for the response Johannes. Some replies inline.
> > >
> > > On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > > > > This patch series provides workingset reporting of user pages in
> > > > > lruvecs, of which coldness can be tracked by accessed bits and fd
> > > > > references. However, the concept of workingset applies generically to
> > > > > all types of memory, which could be kernel slab caches, discardable
> > > > > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > > > > come from slab shrinkers, device drivers, or the userspace.
> > > > > Another interesting idea might be hugepage workingset, so that we can
> > > > > measure the proportion of hugepages backing cold memory. However, with
> > > > > architectures like arm, there may be too many hugepage sizes leading to
> > > > > a combinatorial explosion when exporting stats to the userspace.
> > > > > Nonetheless, the kernel should provide a set of workingset interfaces
> > > > > that is generic enough to accommodate the various use cases, and extensible
> > > > > to potential future use cases.
> > > >
> > > > Doesn't DAMON already provide this information?
> > > >
> > > > CCing SJ.
> > > Thanks for the CC. DAMON was really good at visualizing the memory
> > > access frequencies last time I tried it out!
> >
> > Thank you for this kind acknowledgement, Yuanchu!
> >
> > > For server use cases,
> > > DAMON would benefit from integrations with cgroups.  The key then would be a
> > > standard interface for exporting a cgroup's working set to the user.
> >
> > I show two ways to make DAMON supports cgroups for now.  First way is making
> > another DAMON operations set implementation for cgroups.  I shared a rough idea
> > for this before, probably on kernel summit.  But I haven't had a chance to
> > prioritize this so far.  Please let me know if you need more details.  The
> > second way is extending DAMOS filter to provide more detailed statistics per
> > DAMON-region, and adding another DAMOS action that does nothing but only
> > accounting the detailed statistics.  Using the new DAMOS action, users will be
> > able to know how much of specific DAMON-found regions are filtered out by the
> > given filter.  Because we have DAMOS filter type for cgroups, we can know how
> > much of workingset (or, warm memory) belongs to specific groups.  This can be
> > applied to not only cgroups, but for any DAMOS filter types that exist (e.g.,
> > anonymous page, young page).
> >
> > I believe the second way is simpler to implement while providing information
> > that sufficient for most possible use cases.  I was anyway planning to do this.

I implemented the feature for the second approach I mentioned above.  The
initial version of the feature has recently merged[1] into the mainline as a
part of 6.14-rc1 MM pull request.  DAMON user-space tool (damo) is also updated
for baisc support of it.  I forgot updating that on this thread, sorry.

> For a container orchestrator like kubernetes, the node agents need to
> be able to gather the working set stats at a per-job level. Some jobs
> can create sub-hierarchies as well, so it's important that we have
> hierarchical stats.

This makes sense to me.  And yes, I believe DAMOS filters for memcg could also
be used for this use case, since we can install and use multiple DAMOS filters
in combinations.

The documentation of the feature is not that good and there are many rooms to
improve.  You might not be able to get what you want in a perfect way with the
current implementation.  But we will continue improving it, and I believe we
can make it faster if efforts are gathered.  Of course, I could be wrong, and
whether to use it or not is up to each person :)

Anyway, please feel free to ask me questions or any help about the feature if
you want.

> 
> Do you think it's a good idea to integrate DAMON to provide some
> aggregate stats in a memory controller file? With the DAMOS cgroup
> filter, there can be some kind of interface that a DAMOS action or the
> damo tool could call into. I feel that would be a straightforward and
> integrated way to support cgroups.

DAMON basically exposes its internal information via DAMON sysfs, and DAMON
user-space tool (damo) uses it.  In this case, per-memcg working set could also
be retrieved in the way (directly from DAMON sysfs or indirectly from damo).

But, yes, I think we could make new and optimized ABIs for exposing the
information to user-space in more efficient way depending on the use case, if
needed.  DAMON modules such as DAMON_RECLAIM and DAMON_LRU_SORT provides their
own ABIs that simplified and optimized for their usages.

[1] https://git.kernel.org/torvalds/c/626ffabe67c2


Thanks,
SJ

[...]

Re: [PATCH v4 0/9] mm: workingset reporting

Posted by Yu Zhao 1 year, 2 months ago

On Wed, Nov 27, 2024 at 12:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extensible
> > to potential future use cases.
>
> Doesn't DAMON already provide this information?

Yuanchu might be able to answer this question a lot better than I do,
since he studied DAMON and tried to leverage it in our fleet.

My impression is that there are some fundamental differences in access
detection and accounting mechanisms between the two, i.e., sampling vs
scanning-based detection and non-lruvec vs lruvec-based accounting.

> CCing SJ.
>
> > Use cases
> > ==========
> > Job scheduling
> > On overcommitted hosts, workingset information improves efficiency and
> > reliability by allowing the job scheduler to have better stats on the
> > exact memory requirements of each job. This can manifest in efficiency by
> > landing more jobs on the same host or NUMA node. On the other hand, the
> > job scheduler can also ensure each node has a sufficient amount of memory
> > and does not enter direct reclaim or the kernel OOM path. With workingset
> > information and job priority, the userspace OOM killing or proactive
> > reclaim policy can kick in before the system is under memory pressure.
> > If the job shape is very different from the machine shape, knowing the
> > workingset per-node can also help inform page allocation policies.
> >
> > Proactive reclaim
> > Workingset information allows the a container manager to proactively
> > reclaim memory while not impacting a job's performance. While PSI may
> > provide a reactive measure of when a proactive reclaim has reclaimed too
> > much, workingset reporting allows the policy to be more accurate and
> > flexible.
>
> I'm not sure about more accurate.

Agreed. This is a (very) poor argument, unless there are facts to back this up.

> Access frequency is only half the picture. Whether you need to keep
> memory with a given frequency resident depends on the speed of the
> backing device.

Along a similar line, we also need to consider use cases that don't
involve backing storage, e.g., far memory (remote node). More details below.

> There is memory compression; there is swap on flash; swap on crappy
> flash; swapfiles that share IOPS with co-located filesystems. There is
> zswap+writeback, where avg refault speed can vary dramatically.
>
> You can of course offload much more to a fast zswap backend than to a
> swapfile on a struggling flashdrive, with comparable app performance.
>
> So I think you'd be hard pressed to achieve a high level of accuracy
> in the usecases you list without taking the (often highly dynamic)
> cost of paging / memory transfer into account.
>
> There is a more detailed discussion of this in a paper we wrote on
> proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:
>
> https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf
>
> > Ballooning (similar to proactive reclaim)
> > The last patch of the series extends the virtio-balloon device to report
> > the guest workingset.
> > Balloon policies benefit from workingset to more precisely determine the
> > size of the memory balloon. On end-user devices where memory is scarce and
> > overcommitted, the balloon sizing in multiple VMs running on the same
> > device can be orchestrated with workingset reports from each one.
> > On the server side, workingset reporting allows the balloon controller to
> > inflate the balloon without causing too much file cache to be reclaimed in
> > the guest.
> >
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >
> > [1]
> > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> >
> > Sysfs and Cgroup Interfaces
> > ==========
> > The interfaces are detailed in the patches that introduce them. The main
> > idea here is we break down the workingset per-node per-memcg into time
> > intervals (ms), e.g.
> >
> > 1000 anon=137368 file=24530
> > 20000 anon=34342 file=0
> > 30000 anon=353232 file=333608
> > 40000 anon=407198 file=206052
> > 9223372036854775807 anon=4925624 file=892892
> >
> > Implementation
> > ==========
> > The reporting of user pages is based off of MGLRU, and therefore requires
> > CONFIG_LRU_GEN=y. We would benefit from more MGLRU generations for a more
> > fine-grained workingset report, but we can already gather a lot of data
> > with just four generations. The workingset reporting mechanism is gated
> > behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
> > CONFIG_WORKINGSET_REPORT_AGING.
> >
> > Benchmarks
> > ==========
> > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> > compile and redis benchmarks from openbenchmarking.org. The policy and
> > runner is referred to as WMO (Workload Memory Optimization).
> > The results were based on v3 of the series, but v4 doesn't change the core
> > of the working set reporting and just adds the ballooning counterpart.
> >
> > The timed Linux kernel compilation benchmark shows improvements in peak
> > memory usage with a policy of "swap out all bytes colder than 10 seconds
> > every 40 seconds". A swapfile is configured on SSD.
> > --------------------------------------------
> > peak memory usage (with WMO): 4982.61328 MiB
> > peak memory usage (control): 9569.1367 MiB
> > peak memory reduction: 47.9%
> > --------------------------------------------
> > Benchmark                                           | Experimental     |Control         | Experimental_Std_Dev | Control_Std_Dev
> > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6%                 | 0.1%
> > --------------------------------------------
> > Seconds, fewer is better
>
> You can do this with a recent (>2018) upstream kernel and ~100 lines
> of python [1]. It also works on both LRU implementations.
>
> [1] https://github.com/facebookincubator/senpai
>
> We use this approach in virtually the entire Meta fleet, to offload
> unneeded memory, estimate available capacity for job scheduling, plan
> future capacity needs, and provide accurate memory usage feedback to
> application developers.
>
> It works over a wide variety of CPU and storage configurations with no
> specific tuning.

How would Senpai work for use cases that don't have local storage,
i.e., all memory is mapped by either the fast or the slow tier? (>95%
memory usage in our fleet is mapped and local storage for non-storage
servers is only scratch space.)

My current understanding is that its approach would not be able to
form a feedback loop because there are currently no refaults from the
slow tier (because it's also mapped), and that's where I think this
proposal or something similar can help.

Also this proposal reports histograms, not scalars. So in theory,
userspace can see the projections of its potential actions, rather
than solely rely on trial and error. Of course, this needs to be
backed with data. So yes, some comparisons from real-world use cases
would be very helpful to demonstrate the value of this proposal.

Re: [PATCH v4 0/9] mm: workingset reporting

Posted by SeongJae Park 1 year, 2 months ago

+ damon@lists.linux.dev

I haven't thoroughly read any version of this patch series due to my laziness,
sorry.  So I may saying something completely wrong.  My apology in advance, and
please correct me in the case.

> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references.

DAMON provides data access patterns of user pages.  It is not exactly named as
workingset but a superset of the information.  Users can therefore get the
workingset from DAMON-provided raw data.  So I feel I have to ask if DAMON can
be used for, or help at achieving the purpose of this patch series.

Depending on the detailed definition of workingset, of course, the workingset
we can get from DAMON might not be technically same to what this patch series
aim to provide, and the difference could be somewhat that makes DAMON unable to
be used or help here.  But I cannot know if this is the case with only this
cover letter.

> > However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extensible
> > to potential future use cases.

This again sounds similar to what DAMON aims to provide, to me.  DAMON is
designed to be easy to extend for vairous use cases and internal mechanisms.
Specifically, it separates access check mechanisms and core logic into
different layers, and provides an interface to use for implementing extending
DAMON with new mechanisms.  DAMON's two access check mechanisms for virtual
address spaces and the physical address space are made using the interface,
indeed.  Also there were RFC patch series extending DAMON for NUMA-specific and
write-only access monitoring using NUMA hinting fault and soft-dirty PTEs as
the internal mechanisms.

My humble understanding of the major difference between DAMON and workingset
reporting is the internal mechanism.  Workingset reporting uses MGLRU as the
access check mechanism, while current access check mechanisms for DAMON are
using page table accessed bits checking as the major mechanism.  I think DAMON
can be extended to use MGLRU as its another internal access check mechanism,
but I understand that there could be many things that I overseeing.

Yuanchu, I think it would help me and other reviewers better understand this
patch series if you could share that.  And I will also be more than happy to
help you and others better understanding what DAMON can do or not with the
discussion.

> 
> Doesn't DAMON already provide this information?
> 
> CCing SJ.

Thank you for adding me, Johannes :)

[...]
> It does provide more detailed insight into userspace memory behavior,
> which could be helpful when trying to make sense of applications that
> sit on a rich layer of libraries and complicated runtimes. But here a
> comparison to DAMON would be helpful.

100% agree.

Thanks,
SJ

[...]