[PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent

Vernon Yang posted 4 patches 1 month, 1 week ago
There is a newer version of this series
MAINTAINERS                     |   3 +
include/linux/bpf_huge_memory.h |  35 ++++
include/linux/cgroup-defs.h     |   1 +
include/linux/huge_mm.h         |   6 +
include/linux/psi.h             |   1 +
kernel/bpf/helpers.c            |  29 +++
kernel/sched/psi.c              |  34 +++-
mm/Kconfig                      |  14 ++
mm/Makefile                     |   1 +
mm/bpf_huge_memory.c            | 169 ++++++++++++++++
samples/bpf/.gitignore          |   1 +
samples/bpf/Makefile            |   7 +-
samples/bpf/mthp_ext.bpf.c      | 142 +++++++++++++
samples/bpf/mthp_ext.c          | 340 ++++++++++++++++++++++++++++++++
samples/bpf/mthp_ext.h          |  30 +++
15 files changed, 804 insertions(+), 9 deletions(-)
create mode 100644 include/linux/bpf_huge_memory.h
create mode 100644 mm/bpf_huge_memory.c
create mode 100644 samples/bpf/mthp_ext.bpf.c
create mode 100644 samples/bpf/mthp_ext.c
create mode 100644 samples/bpf/mthp_ext.h
[PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
Posted by Vernon Yang 1 month, 1 week ago
From: Vernon Yang <yanglincheng@kylinos.cn>

Hi all,

Background
==========

As is well known, a system can simultaneously run multiple different
scenarios. However, THP is not beneficial in every scenario — it is only
most suitable for memory-intensive applications that are not sensitive
to tail latency. For example, Redis, which is sensitive to tail latency,
is not suitable for THP. But in practice, due to Redis issues, the
entire THP functionality is often turned off, preventing other scenarios
from benefiting from it.

There are also some embedded scenarios (e.g. Android) that directly use
2MB THP, where the granularity is too large. Therefore, we introduced
mTHP in v6.8, which supports multiple-size THP. In practice, however, we
still globally fix a single mTHP size and are unable to automatically
select different mTHP sizes based on different scenarios.

After testing, it was found that

- When the system has a lot of free memory, it is normal for Redis to
  use mTHP. performance degradation in Redis only occurs when the system
  is under high memory pressure.
- Additionally, when a large number of small-memory processes use mTHP,
  memory waste is prone to occur, and performance degradation may also
  happen during fast memory allocation/release.

Previously, "Cgroup-based THP control"[1] was proposed, but it had the
following issues.

- It breaks the cgroup hierarchy property.
- Add new THP knobs, making sysadmin's job more complex

Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
following issues.

- It didn't address the issue on the per-process mode.
- For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
  the same objective, there is no need to add two mechanisms for the
  same purpose.
- Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
  faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
  cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
  implementation.
- The test cases are too simplistic, lacking eBPF cases similar to real
  workloads such as sched_ext.

If I miss some thing, please let me know. Thanks!

Solution
========

This series will solve all the problems mentioned above.

1. Using cgroup-bpf to customize mTHP size for different scenarios
2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
   under the same parent-cgroup adopt the same eBPF program. Only multiple
   sibling-cgroups (where the parent-cgroup has no attached eBPF program)
   are supported to attach multiple different eBPF programs without
   breaking the hierarchy property of the cgroup.
3. Automatically select different mTHP sizes for different cgroups,
   let's focus on making them truly transparent.
4. Design mthp_ext case to address real workload issues.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Performance
===========

The below is some performance test results, testing on x86_64 machine
(AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).

NOTE: The following always/never labels indicate setting all mTHP sizes
to always/never. Detailed test script reference[4].

redis results
~~~~~~~~~~~~~

command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set

When cgroup memory.high=max.

| redis-noBGSAVE | always      | never                | always+mthp_ext      |
|----------------|-------------|----------------------|----------------------|
| rps            | 1410824.167 | 1210387.500 (-14.2%) | 1265659.833 (-10.3%) |
| avg_latency_ms | 0.220       | 0.259       (-17.7%) | 0.247       (-12.3%) |
| p95_latency_ms | 0.618       | 0.708       (-14.6%) | 0.676       (-9.40%) |
| p99_latency_ms | 0.687       | 0.818       (-19.1%) | 0.756       (-10.0%) |

| redis-BGSAVE   | always      | never                | always+mthp_ext      |
|----------------|-------------|----------------------|----------------------|
| rps            | 1418032.127 | 1212306.873 (-14.5%) | 1261069.373 (-11.1%) |
| avg_latency_ms | 0.218       | 0.259       (-18.8%) | 0.248       (-13.8%) |
| p95_latency_ms | 0.620       | 0.714       (-15.2%) | 0.687       (-10.8%) |
| p99_latency_ms | 0.684       | 0.828       (-21.1%) | 0.756       (-10.5%) |

When cgroup memory.high=2G.

| redis-noBGSAVE | always    | never                 | always+mthp_ext       |
|----------------|-----------|-----------------------|-----------------------|
| rps            | 24813.980 | 1049254.583 (4128.5%) | 1063171.270 (4184.6%) |
| avg_latency_ms | 13.317    | 0.302       (  97.7%) | 0.298       (  97.8%) |
| p95_latency_ms | 23.220    | 0.754       (  96.8%) | 0.828       (  96.4%) |
| p99_latency_ms | 369.492   | 1.154       (  99.7%) | 1.615       (  99.6%) |

| redis-BGSAVE   | always    | never                 | always+mthp_ext       |
|----------------|-----------|-----------------------|-----------------------|
| rps            | 48373.433 | 1058403.500 (2088.0%) | 1070805.707 (2113.6%) |
| avg_latency_ms | 6.884     | 0.300       (  95.6%) | 0.296       (  95.7%) |
| p95_latency_ms | 16.474    | 0.743       (  95.5%) | 0.820       (  95.0%) |
| p99_latency_ms | 326.058   | 1.170       (  99.6%) | 1.586       (  99.5%) |

When the redis is under no memory pressure, RPS drops by 10.3% (from 1.4M to
1.2M, Is this within the acceptable range?).

However, under high memory pressure, RPS improve by 4184.6% (from 24K to 1M),
while significantly reducing the tail latency by 99%.

unixbench results
~~~~~~~~~~~~~~~~~

command: ./Run -c 1 shell8

| unixbench shell8 | always  |      never      | always+mthp_ext |
|------------------|---------|-----------------|-----------------|
| Score            | 23019.4 | 24378.3 (5.90%) | 24314.5 (5.63%) |

mthp_ext improved by 5.63%.

kernbench results
~~~~~~~~~~~~~~~~~

When cgroup memory.high=max, mthp_ext no regression.

                            always                 never               always+mthp_ext
Amean     user-32    19666.44 (   0.00%)    18464.56 *   6.11%*    19650.13 *   0.08%*
Amean     syst-32     1169.16 (   0.00%)     2235.17 * -91.18%*     1169.42 (  -0.02%)
Amean     elsp-32      702.51 (   0.00%)      699.90 *   0.37%*      702.15 (   0.05%)
BAmean-95 user-32    19665.93 (   0.00%)    18461.86 (   6.12%)    19647.61 (   0.09%)
BAmean-95 syst-32     1168.68 (   0.00%)     2234.27 ( -91.18%)     1169.20 (  -0.04%)
BAmean-95 elsp-32      702.34 (   0.00%)      699.80 (   0.36%)      702.04 (   0.04%)
BAmean-99 user-32    19665.93 (   0.00%)    18461.86 (   6.12%)    19647.61 (   0.09%)
BAmean-99 syst-32     1168.68 (   0.00%)     2234.27 ( -91.18%)     1169.20 (  -0.04%)
BAmean-99 elsp-32      702.34 (   0.00%)      699.80 (   0.36%)      702.04 (   0.04%)

When cgroup memory.high=2G, mthp_ext improved by 20.98%.

                            always                 never               always+mthp_ext
Amean     user-32    20459.89 (   0.00%)    18517.24 *   9.49%*    19963.73 *   2.43%*
Amean     syst-32    11890.63 (   0.00%)     6681.95 *  43.80%*     9395.94 *  20.98%*
Amean     elsp-32     1305.29 (   0.00%)      928.13 *  28.89%*     1109.37 *  15.01%*
BAmean-95 user-32    20439.38 (   0.00%)    18510.65 (   9.44%)    19957.89 (   2.36%)
BAmean-95 syst-32    11789.99 (   0.00%)     6679.03 (  43.35%)     9381.77 (  20.43%)
BAmean-95 elsp-32     1302.18 (   0.00%)      927.89 (  28.74%)     1108.65 (  14.86%)
BAmean-99 user-32    20439.38 (   0.00%)    18510.65 (   9.44%)    19957.89 (   2.36%)
BAmean-99 syst-32    11789.99 (   0.00%)     6679.03 (  43.35%)     9381.77 (  20.43%)
BAmean-99 elsp-32     1302.18 (   0.00%)      927.89 (  28.74%)     1108.65 (  14.86%)

TODO
====

- Do not destroy the cgroup hierarchy property. If an eBPF program
  already exists in the sub-cgroup, trigger an error and clear the
  already set bpf_mthp_ops data.
- mthp_ext handles different "enum tva_type" values. For example, for
  small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
  TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
  size. Under high memory pressure, only 4KB is used for
  TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
  collapse all mthp size.
- selftest

If there are additional scenarios, please let me know as well, so I can
conduct further prototype verification tests to make mTHP more
transparent.

If any of the above the strategies can be integrated into the kernel,
please let me know. I would be delighted to incorporate these strategies
into the kernel.

This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.

Thank you very much for your comments and discussions.

[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
[2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
[3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
[4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext

Vernon Yang (4):
  psi: add psi_group_flush_stats() function
  bpf: add bpf_cgroup_{flush_stats,stall} function
  mm: introduce bpf_mthp_ops struct ops
  samples: bpf: add mthp_ext

 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  35 ++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 +
 include/linux/psi.h             |   1 +
 kernel/bpf/helpers.c            |  29 +++
 kernel/sched/psi.c              |  34 +++-
 mm/Kconfig                      |  14 ++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 169 ++++++++++++++++
 samples/bpf/.gitignore          |   1 +
 samples/bpf/Makefile            |   7 +-
 samples/bpf/mthp_ext.bpf.c      | 142 +++++++++++++
 samples/bpf/mthp_ext.c          | 340 ++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h          |  30 +++
 15 files changed, 804 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

--
2.53.0

Re: [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
Posted by Yafang Shao 1 month, 1 week ago
On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Hi all,
>
> Background
> ==========
>
> As is well known, a system can simultaneously run multiple different
> scenarios. However, THP is not beneficial in every scenario — it is only
> most suitable for memory-intensive applications that are not sensitive
> to tail latency. For example, Redis, which is sensitive to tail latency,
> is not suitable for THP. But in practice, due to Redis issues, the
> entire THP functionality is often turned off, preventing other scenarios
> from benefiting from it.
>
> There are also some embedded scenarios (e.g. Android) that directly use
> 2MB THP, where the granularity is too large. Therefore, we introduced
> mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> still globally fix a single mTHP size and are unable to automatically
> select different mTHP sizes based on different scenarios.
>
> After testing, it was found that
>
> - When the system has a lot of free memory, it is normal for Redis to
>   use mTHP. performance degradation in Redis only occurs when the system
>   is under high memory pressure.
> - Additionally, when a large number of small-memory processes use mTHP,
>   memory waste is prone to occur, and performance degradation may also
>   happen during fast memory allocation/release.
>
> Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> following issues.
>
> - It breaks the cgroup hierarchy property.
> - Add new THP knobs, making sysadmin's job more complex
>
> Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> following issues.
>
> - It didn't address the issue on the per-process mode.
> - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
>   the same objective, there is no need to add two mechanisms for the
>   same purpose.
> - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once

Hello,

The primary hurdles preventing BPF-THP from being upstreamed are:

- Uncertainty regarding the stable API

  It remains unclear whether the following API is sufficient for
  BPF-THP requirements:

  unsigned long
  bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
                                            unsigned long orders)
  {
      return orders;
  }

- Ongoing integration of cgroups and struct-ops

  Work is still in progress to integrate cgroup support with
struct_ops (see also
https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
We should wait for this infrastructure to land before introducing
new cgroup-based struct_ops.

>   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
>   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
>   implementation.
> - The test cases are too simplistic, lacking eBPF cases similar to real
>   workloads such as sched_ext.
>
> If I miss some thing, please let me know. Thanks!

BTW, it would be better to include the original authors in the CC
list, especially since their work is cited in your commit message. ;)

--
Regards

Yafang
Re: [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
Posted by Vernon Yang 1 month, 1 week ago
On Thu, May 7, 2026 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> >
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > Hi all,
> >
> > Background
> > ==========
> >
> > As is well known, a system can simultaneously run multiple different
> > scenarios. However, THP is not beneficial in every scenario — it is only
> > most suitable for memory-intensive applications that are not sensitive
> > to tail latency. For example, Redis, which is sensitive to tail latency,
> > is not suitable for THP. But in practice, due to Redis issues, the
> > entire THP functionality is often turned off, preventing other scenarios
> > from benefiting from it.
> >
> > There are also some embedded scenarios (e.g. Android) that directly use
> > 2MB THP, where the granularity is too large. Therefore, we introduced
> > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > still globally fix a single mTHP size and are unable to automatically
> > select different mTHP sizes based on different scenarios.
> >
> > After testing, it was found that
> >
> > - When the system has a lot of free memory, it is normal for Redis to
> >   use mTHP. performance degradation in Redis only occurs when the system
> >   is under high memory pressure.
> > - Additionally, when a large number of small-memory processes use mTHP,
> >   memory waste is prone to occur, and performance degradation may also
> >   happen during fast memory allocation/release.
> >
> > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > following issues.
> >
> > - It breaks the cgroup hierarchy property.
> > - Add new THP knobs, making sysadmin's job more complex
> >
> > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > following issues.
> >
> > - It didn't address the issue on the per-process mode.
> > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> >   the same objective, there is no need to add two mechanisms for the
> >   same purpose.
> > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
>
> Hello,
>
> The primary hurdles preventing BPF-THP from being upstreamed are:
>
> - Uncertainty regarding the stable API
>
>   It remains unclear whether the following API is sufficient for
>   BPF-THP requirements:

Thank you for pointing this out. I will add it to this issue list in
the next version.

At the same time, I also encourage everyone to actively provide
relevant real workload scenarios. I will conduct related analyses and
develop the mthp_ext to address them, so we can determine which ABI
can satisfy the BPF-THP requirements.

>   unsigned long
>   bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
>                                             unsigned long orders)
>   {
>       return orders;
>   }
>
> - Ongoing integration of cgroups and struct-ops
>
>   Work is still in progress to integrate cgroup support with
> struct_ops (see also
> https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
> We should wait for this infrastructure to land before introducing
> new cgroup-based struct_ops.

This work can be carried out in parallel with integrating cgroup
support with struct_ops. I will focus on addressing real workload
scenarios to further clear/stabilize the BPF-THP ABI.

> >   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> >   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> >   implementation.
> > - The test cases are too simplistic, lacking eBPF cases similar to real
> >   workloads such as sched_ext.
> >
> > If I miss some thing, please let me know. Thanks!
>
> BTW, it would be better to include the original authors in the CC
> list, especially since their work is cited in your commit message. ;)

OK, I will CC the relevant authors in the next version.

--
Cheers,
Vernon
Re: [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
Posted by Yafang Shao 1 month, 1 week ago
On Thu, May 7, 2026 at 8:51 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> On Thu, May 7, 2026 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> > >
> > > From: Vernon Yang <yanglincheng@kylinos.cn>
> > >
> > > Hi all,
> > >
> > > Background
> > > ==========
> > >
> > > As is well known, a system can simultaneously run multiple different
> > > scenarios. However, THP is not beneficial in every scenario — it is only
> > > most suitable for memory-intensive applications that are not sensitive
> > > to tail latency. For example, Redis, which is sensitive to tail latency,
> > > is not suitable for THP. But in practice, due to Redis issues, the
> > > entire THP functionality is often turned off, preventing other scenarios
> > > from benefiting from it.
> > >
> > > There are also some embedded scenarios (e.g. Android) that directly use
> > > 2MB THP, where the granularity is too large. Therefore, we introduced
> > > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > > still globally fix a single mTHP size and are unable to automatically
> > > select different mTHP sizes based on different scenarios.
> > >
> > > After testing, it was found that
> > >
> > > - When the system has a lot of free memory, it is normal for Redis to
> > >   use mTHP. performance degradation in Redis only occurs when the system
> > >   is under high memory pressure.
> > > - Additionally, when a large number of small-memory processes use mTHP,
> > >   memory waste is prone to occur, and performance degradation may also
> > >   happen during fast memory allocation/release.
> > >
> > > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > > following issues.
> > >
> > > - It breaks the cgroup hierarchy property.
> > > - Add new THP knobs, making sysadmin's job more complex
> > >
> > > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > > following issues.
> > >
> > > - It didn't address the issue on the per-process mode.
> > > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> > >   the same objective, there is no need to add two mechanisms for the
> > >   same purpose.
> > > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
> >
> > Hello,
> >
> > The primary hurdles preventing BPF-THP from being upstreamed are:
> >
> > - Uncertainty regarding the stable API
> >
> >   It remains unclear whether the following API is sufficient for
> >   BPF-THP requirements:
>
> Thank you for pointing this out. I will add it to this issue list in
> the next version.
>
> At the same time, I also encourage everyone to actively provide
> relevant real workload scenarios. I will conduct related analyses and
> develop the mthp_ext to address them, so we can determine which ABI
> can satisfy the BPF-THP requirements.
>
> >   unsigned long
> >   bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
> >                                             unsigned long orders)
> >   {
> >       return orders;
> >   }
> >
> > - Ongoing integration of cgroups and struct-ops
> >
> >   Work is still in progress to integrate cgroup support with
> > struct_ops (see also
> > https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
> > We should wait for this infrastructure to land before introducing
> > new cgroup-based struct_ops.
>
> This work can be carried out in parallel with integrating cgroup
> support with struct_ops. I will focus on addressing real workload
> scenarios to further clear/stabilize the BPF-THP ABI.
>
> > >   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> > >   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> > >   implementation.
> > > - The test cases are too simplistic, lacking eBPF cases similar to real
> > >   workloads such as sched_ext.
> > >
> > > If I miss some thing, please let me know. Thanks!
> >
> > BTW, it would be better to include the original authors in the CC
> > list, especially since their work is cited in your commit message. ;)
>
> OK, I will CC the relevant authors in the next version.

Hello Vernon,

I believe it would be best to hold off until David provides guidance
on the future direction. While I am not currently active on BPF-THP,
we are still looking for the right opportunity to upstream it. The
primary difference is that your implementation is cgroup-based;
however, we also plan to switch to that approach once Roman’s work
lands.

I don't mean to imply that BPF-THP is solely my project, but I suspect
you will eventually arrive at a similar implementation to what I’ve
developed. So far, I haven’t found a more efficient API than the
following:

  unsigned long
  bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
                                             unsigned long orders)
  {
       return orders;
  }

-- 
Regards
Yafang
Re: [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
Posted by Vernon Yang 1 month, 1 week ago
On Thu, May 7, 2026 at 9:19 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, May 7, 2026 at 8:51 PM Vernon Yang <vernon2gm@gmail.com> wrote:
> >
> > On Thu, May 7, 2026 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> > > >
> > > > From: Vernon Yang <yanglincheng@kylinos.cn>
> > > >
> > > > Hi all,
> > > >
> > > > Background
> > > > ==========
> > > >
> > > > As is well known, a system can simultaneously run multiple different
> > > > scenarios. However, THP is not beneficial in every scenario — it is only
> > > > most suitable for memory-intensive applications that are not sensitive
> > > > to tail latency. For example, Redis, which is sensitive to tail latency,
> > > > is not suitable for THP. But in practice, due to Redis issues, the
> > > > entire THP functionality is often turned off, preventing other scenarios
> > > > from benefiting from it.
> > > >
> > > > There are also some embedded scenarios (e.g. Android) that directly use
> > > > 2MB THP, where the granularity is too large. Therefore, we introduced
> > > > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > > > still globally fix a single mTHP size and are unable to automatically
> > > > select different mTHP sizes based on different scenarios.
> > > >
> > > > After testing, it was found that
> > > >
> > > > - When the system has a lot of free memory, it is normal for Redis to
> > > >   use mTHP. performance degradation in Redis only occurs when the system
> > > >   is under high memory pressure.
> > > > - Additionally, when a large number of small-memory processes use mTHP,
> > > >   memory waste is prone to occur, and performance degradation may also
> > > >   happen during fast memory allocation/release.
> > > >
> > > > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > > > following issues.
> > > >
> > > > - It breaks the cgroup hierarchy property.
> > > > - Add new THP knobs, making sysadmin's job more complex
> > > >
> > > > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > > > following issues.
> > > >
> > > > - It didn't address the issue on the per-process mode.
> > > > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> > > >   the same objective, there is no need to add two mechanisms for the
> > > >   same purpose.
> > > > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
> > >
> > > Hello,
> > >
> > > The primary hurdles preventing BPF-THP from being upstreamed are:
> > >
> > > - Uncertainty regarding the stable API
> > >
> > >   It remains unclear whether the following API is sufficient for
> > >   BPF-THP requirements:
> >
> > Thank you for pointing this out. I will add it to this issue list in
> > the next version.
> >
> > At the same time, I also encourage everyone to actively provide
> > relevant real workload scenarios. I will conduct related analyses and
> > develop the mthp_ext to address them, so we can determine which ABI
> > can satisfy the BPF-THP requirements.
> >
> > >   unsigned long
> > >   bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
> > >                                             unsigned long orders)
> > >   {
> > >       return orders;
> > >   }
> > >
> > > - Ongoing integration of cgroups and struct-ops
> > >
> > >   Work is still in progress to integrate cgroup support with
> > > struct_ops (see also
> > > https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
> > > We should wait for this infrastructure to land before introducing
> > > new cgroup-based struct_ops.
> >
> > This work can be carried out in parallel with integrating cgroup
> > support with struct_ops. I will focus on addressing real workload
> > scenarios to further clear/stabilize the BPF-THP ABI.
> >
> > > >   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> > > >   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> > > >   implementation.
> > > > - The test cases are too simplistic, lacking eBPF cases similar to real
> > > >   workloads such as sched_ext.
> > > >
> > > > If I miss some thing, please let me know. Thanks!
> > >
> > > BTW, it would be better to include the original authors in the CC
> > > list, especially since their work is cited in your commit message. ;)
> >
> > OK, I will CC the relevant authors in the next version.
>
> Hello Vernon,
>
> I believe it would be best to hold off until David provides guidance
> on the future direction. While I am not currently active on BPF-THP,
> we are still looking for the right opportunity to upstream it. The
> primary difference is that your implementation is cgroup-based;
> however, we also plan to switch to that approach once Roman’s work
> lands.

LSF/MM/BPF has just concluded, I think David will soon share some new
directions with us.

If it is merely about converting to cgroup implementation for
per-process version (I have already completed development based on
cgroup-bpf), there is no need for you to start from scratch. You can
directly test whether mthp_ext addresses your real workload scenarios.

Based on the current patchset, we can collaborate on development, as
our shared goal is to get this done right.

--
Cheers,
Vernon