mm: multi-gen LRU: per-process heatmaps

[RFC PATCH 0/2] mm: multi-gen LRU: per-process heatmaps

Posted by Yuanchu Xie 3 years, 6 months ago

Today, the MGLRU debugfs interface (/sys/kernel/debug/lru_gen) provides
a histogram counting the number of pages belonging to each generation,
providing some data for memory coldness, but we don't actually know
where the memory actually is. However, since MGLRU revamps the page
reclaim mechanism to walk page tables, we can hook into MGLRU page table
access bit harvesting with a BPF program to collect information on
relative hotness and coldness, NUMA nodes, whether a page is anon/file,
etc.

Using BPF programs to collect and aggregate page access information
allows for the userspace agent to customize what to collect and how to
aggregate. It could focus on a particular region of interest and count a
moving average access frequency, or find allocations that are never
accessed that could be eliminated all together. Currently MGLRU relies
on heuristics with regards to what generation a page is assigned, for
example, pages accessed through page tables are always assigned to the
youngest generation. Exposing page access data can allow future work to
customize page generation assignments (with more BPF).

We demonstrate feasibility with a proof-of-concept that prints a live
heatmap of a process, with configurable MGLRU aging intervals and
aggregation intervals. This is a very rough PoC that still needs a lot
of work, but it shows a lot can be done by exposing page access
information from MGLRU. I will be presenting this work at the coming
LPC.

As an example. I ran the memtier benchmark[1] and captured a heatmap of
memcached being populated and running the benchmark (similar to the one
Yu posted for OpenWRT[2]):

$ cat ./run_memtier_benchmark.sh
    run_memtier_benchmark()
    {
        # populate dataset
        memtier_benchmark/memtier_benchmark -s 127.0.0.1 -p 11211 \
            -P memcache_binary -n allkeys -t 1 -c 1 --ratio 1:0 --pipeline 8 \
            --key-minimum=1 --key-maximum=$2 --key-pattern=P:P \
            -d 1000

        # access dataset using Guassian pattern
        memtier_benchmark/memtier_benchmark -s 127.0.0.1 -p 11211 \
            -P memcache_binary --test-time $1 -t 1 -c 1 --ratio 0:1 \
            --pipeline 8 --key-minimum=1 --key-maximum=$2 \
            --key-pattern=G:G --randomize --distinct-client-seed

        # collect results
    }

    run_duration_secs=3600
    max_key=8000000

    run_memtier_benchmark $run_duration_secs $max_key

In the following screenshot we can see the process of populating the
dataset and accessing the dataset:
https://services.google.com/fh/files/events/memcached_memtier_startup.png

Patch 1 adds the infrastructure to enable BPF programs to monitor page
access bit harvesting

Patch 2 includes a proof-of-concept python TUI program displaying online
per-process heatmaps.

[1] https://github.com/RedisLabs/memtier_benchmark
[2] https://lore.kernel.org/all/20220831041731.3836322-1-yuzhao@google.com/

Yuanchu Xie (2):
  mm: multi-gen LRU: support page access info harvesting with eBPF
  mm: add a BPF-based per-process heatmap tool

 include/linux/mmzone.h          |   1 +
 mm/vmscan.c                     | 154 ++++++++
 tools/vm/heatmap/Makefile       |  30 ++
 tools/vm/heatmap/heatmap.bpf.c  | 123 +++++++
 tools/vm/heatmap/heatmap.user.c | 188 ++++++++++
 tools/vm/heatmap/heatmap_tui.py | 600 ++++++++++++++++++++++++++++++++
 6 files changed, 1096 insertions(+)
 create mode 100644 tools/vm/heatmap/Makefile
 create mode 100644 tools/vm/heatmap/heatmap.bpf.c
 create mode 100644 tools/vm/heatmap/heatmap.user.c
 create mode 100755 tools/vm/heatmap/heatmap_tui.py

-- 
2.37.2.789.g6183377224-goog

Re: [RFC PATCH 0/2] mm: multi-gen LRU: per-process heatmaps

Posted by Muyang Tian 1 year, 3 months ago

Hi all,
It has been a long time since this patchset[0] submitted, and I've been doing something similar recently.
I wonder why this patchset remains unmerged/uncommented? Is there any other similar work?

Thanks!

[0] https://lore.kernel.org/all/20220911083418.2818369-1-yuanchu@google.com/

Re: [RFC PATCH 0/2] mm: multi-gen LRU: per-process heatmaps

Posted by Yuanchu Xie 1 year, 2 months ago

On Thu, Jan 2, 2025 at 7:15 PM Muyang Tian <tianmuyang@huawei.com> wrote:
>
> Hi all,
> It has been a long time since this patchset[0] submitted, and I've been doing something similar recently.
> I wonder why this patchset remains unmerged/uncommented? Is there any other similar work?
Hi Muyang,

I'd love to learn about your use case and your approach as well. The
code here requires some polish and cleanup, but it's mostly bpf code
in tooling. Happy to work on it.

Thanks,
Yuanchu

Re: [RFC PATCH 0/2] mm: multi-gen LRU: per-process heatmaps

Posted by Muyang Tian 1 year, 2 months ago

Hi Yuanchu,

I'm working on observability and the programmable page generation policy of MGLRU based on eBPF, using a similar approach to yours. 
I'd like to know if there is any related work, such as the application of eBPF in MGLRU? 
Also, this RFC provides a user space interface to call run_aging(), which is called periodically in the demo. 
Do you plan to optimize this, perhaps by calling run_aging() based on page access observation results?

Thanks!

Re: [RFC PATCH 0/2] mm: multi-gen LRU: per-process heatmaps

Posted by Yuanchu Xie 1 year, 2 months ago

On Thu, Jan 9, 2025 at 3:48 AM Muyang Tian <tianmuyang@huawei.com> wrote:
>
> Hi Yuanchu,
>
> I'm working on observability and the programmable page generation policy of MGLRU based on eBPF, using a similar approach to yours.
> I'd like to know if there is any related work, such as the application of eBPF in MGLRU?
Not that I'm aware of. There were some patches fiddling with the
generation placement of pages but I can't seem to find them.

> Also, this RFC provides a user space interface to call run_aging(), which is called periodically in the demo.
> Do you plan to optimize this, perhaps by calling run_aging() based on page access observation results?
Right now I don't have any plans to optimize this patch series. What
are your use cases? All I cared about was one off observability of
accesses and not much thought went into optimizing the tool.

Thanks,
Yuanchu