MAINTAINERS | 3 + include/linux/bpf_huge_memory.h | 35 ++++ include/linux/cgroup-defs.h | 1 + include/linux/huge_mm.h | 6 + include/linux/psi.h | 1 + kernel/bpf/helpers.c | 29 +++ kernel/sched/psi.c | 34 +++- mm/Kconfig | 14 ++ mm/Makefile | 1 + mm/bpf_huge_memory.c | 169 ++++++++++++++++ samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 7 +- samples/bpf/mthp_ext.bpf.c | 142 +++++++++++++ samples/bpf/mthp_ext.c | 340 ++++++++++++++++++++++++++++++++ samples/bpf/mthp_ext.h | 30 +++ 15 files changed, 804 insertions(+), 9 deletions(-) create mode 100644 include/linux/bpf_huge_memory.h create mode 100644 mm/bpf_huge_memory.c create mode 100644 samples/bpf/mthp_ext.bpf.c create mode 100644 samples/bpf/mthp_ext.c create mode 100644 samples/bpf/mthp_ext.h
From: Vernon Yang <yanglincheng@kylinos.cn>
Hi all,
Background
==========
As is well known, a system can simultaneously run multiple different
scenarios. However, THP is not beneficial in every scenario — it is only
most suitable for memory-intensive applications that are not sensitive
to tail latency. For example, Redis, which is sensitive to tail latency,
is not suitable for THP. But in practice, due to Redis issues, the
entire THP functionality is often turned off, preventing other scenarios
from benefiting from it.
There are also some embedded scenarios (e.g. Android) that directly use
2MB THP, where the granularity is too large. Therefore, we introduced
mTHP in v6.8, which supports multiple-size THP. In practice, however, we
still globally fix a single mTHP size and are unable to automatically
select different mTHP sizes based on different scenarios.
After testing, it was found that
- When the system has a lot of free memory, it is normal for Redis to
use mTHP. performance degradation in Redis only occurs when the system
is under high memory pressure.
- Additionally, when a large number of small-memory processes use mTHP,
memory waste is prone to occur, and performance degradation may also
happen during fast memory allocation/release.
Previously, "Cgroup-based THP control"[1] was proposed, but it had the
following issues.
- It breaks the cgroup hierarchy property.
- Add new THP knobs, making sysadmin's job more complex
Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
following issues.
- It didn't address the issue on the per-process mode.
- For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
the same objective, there is no need to add two mechanisms for the
same purpose.
- Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
implementation.
- The test cases are too simplistic, lacking eBPF cases similar to real
workloads such as sched_ext.
If I miss some thing, please let me know. Thanks!
Solution
========
This series will solve all the problems mentioned above.
1. Using cgroup-bpf to customize mTHP size for different scenarios
2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
under the same parent-cgroup adopt the same eBPF program. Only multiple
sibling-cgroups (where the parent-cgroup has no attached eBPF program)
are supported to attach multiple different eBPF programs without
breaking the hierarchy property of the cgroup.
3. Automatically select different mTHP sizes for different cgroups,
let's focus on making them truly transparent.
4. Design mthp_ext case to address real workload issues.
The main functions of the mthp_ext are as follows:
- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
memory (default 16MB), small-memory processes will automatically
fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
support for specifying any cgroup directory.
Performance
===========
The below is some performance test results, testing on x86_64 machine
(AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).
NOTE: The following always/never labels indicate setting all mTHP sizes
to always/never. Detailed test script reference[4].
redis results
~~~~~~~~~~~~~
command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set
When cgroup memory.high=max.
| redis-noBGSAVE | always | never | always+mthp_ext |
|----------------|-------------|----------------------|----------------------|
| rps | 1410824.167 | 1210387.500 (-14.2%) | 1265659.833 (-10.3%) |
| avg_latency_ms | 0.220 | 0.259 (-17.7%) | 0.247 (-12.3%) |
| p95_latency_ms | 0.618 | 0.708 (-14.6%) | 0.676 (-9.40%) |
| p99_latency_ms | 0.687 | 0.818 (-19.1%) | 0.756 (-10.0%) |
| redis-BGSAVE | always | never | always+mthp_ext |
|----------------|-------------|----------------------|----------------------|
| rps | 1418032.127 | 1212306.873 (-14.5%) | 1261069.373 (-11.1%) |
| avg_latency_ms | 0.218 | 0.259 (-18.8%) | 0.248 (-13.8%) |
| p95_latency_ms | 0.620 | 0.714 (-15.2%) | 0.687 (-10.8%) |
| p99_latency_ms | 0.684 | 0.828 (-21.1%) | 0.756 (-10.5%) |
When cgroup memory.high=2G.
| redis-noBGSAVE | always | never | always+mthp_ext |
|----------------|-----------|-----------------------|-----------------------|
| rps | 24813.980 | 1049254.583 (4128.5%) | 1063171.270 (4184.6%) |
| avg_latency_ms | 13.317 | 0.302 ( 97.7%) | 0.298 ( 97.8%) |
| p95_latency_ms | 23.220 | 0.754 ( 96.8%) | 0.828 ( 96.4%) |
| p99_latency_ms | 369.492 | 1.154 ( 99.7%) | 1.615 ( 99.6%) |
| redis-BGSAVE | always | never | always+mthp_ext |
|----------------|-----------|-----------------------|-----------------------|
| rps | 48373.433 | 1058403.500 (2088.0%) | 1070805.707 (2113.6%) |
| avg_latency_ms | 6.884 | 0.300 ( 95.6%) | 0.296 ( 95.7%) |
| p95_latency_ms | 16.474 | 0.743 ( 95.5%) | 0.820 ( 95.0%) |
| p99_latency_ms | 326.058 | 1.170 ( 99.6%) | 1.586 ( 99.5%) |
When the redis is under no memory pressure, RPS drops by 10.3% (from 1.4M to
1.2M, Is this within the acceptable range?).
However, under high memory pressure, RPS improve by 4184.6% (from 24K to 1M),
while significantly reducing the tail latency by 99%.
unixbench results
~~~~~~~~~~~~~~~~~
command: ./Run -c 1 shell8
| unixbench shell8 | always | never | always+mthp_ext |
|------------------|---------|-----------------|-----------------|
| Score | 23019.4 | 24378.3 (5.90%) | 24314.5 (5.63%) |
mthp_ext improved by 5.63%.
kernbench results
~~~~~~~~~~~~~~~~~
When cgroup memory.high=max, mthp_ext no regression.
always never always+mthp_ext
Amean user-32 19666.44 ( 0.00%) 18464.56 * 6.11%* 19650.13 * 0.08%*
Amean syst-32 1169.16 ( 0.00%) 2235.17 * -91.18%* 1169.42 ( -0.02%)
Amean elsp-32 702.51 ( 0.00%) 699.90 * 0.37%* 702.15 ( 0.05%)
BAmean-95 user-32 19665.93 ( 0.00%) 18461.86 ( 6.12%) 19647.61 ( 0.09%)
BAmean-95 syst-32 1168.68 ( 0.00%) 2234.27 ( -91.18%) 1169.20 ( -0.04%)
BAmean-95 elsp-32 702.34 ( 0.00%) 699.80 ( 0.36%) 702.04 ( 0.04%)
BAmean-99 user-32 19665.93 ( 0.00%) 18461.86 ( 6.12%) 19647.61 ( 0.09%)
BAmean-99 syst-32 1168.68 ( 0.00%) 2234.27 ( -91.18%) 1169.20 ( -0.04%)
BAmean-99 elsp-32 702.34 ( 0.00%) 699.80 ( 0.36%) 702.04 ( 0.04%)
When cgroup memory.high=2G, mthp_ext improved by 20.98%.
always never always+mthp_ext
Amean user-32 20459.89 ( 0.00%) 18517.24 * 9.49%* 19963.73 * 2.43%*
Amean syst-32 11890.63 ( 0.00%) 6681.95 * 43.80%* 9395.94 * 20.98%*
Amean elsp-32 1305.29 ( 0.00%) 928.13 * 28.89%* 1109.37 * 15.01%*
BAmean-95 user-32 20439.38 ( 0.00%) 18510.65 ( 9.44%) 19957.89 ( 2.36%)
BAmean-95 syst-32 11789.99 ( 0.00%) 6679.03 ( 43.35%) 9381.77 ( 20.43%)
BAmean-95 elsp-32 1302.18 ( 0.00%) 927.89 ( 28.74%) 1108.65 ( 14.86%)
BAmean-99 user-32 20439.38 ( 0.00%) 18510.65 ( 9.44%) 19957.89 ( 2.36%)
BAmean-99 syst-32 11789.99 ( 0.00%) 6679.03 ( 43.35%) 9381.77 ( 20.43%)
BAmean-99 elsp-32 1302.18 ( 0.00%) 927.89 ( 28.74%) 1108.65 ( 14.86%)
TODO
====
- Do not destroy the cgroup hierarchy property. If an eBPF program
already exists in the sub-cgroup, trigger an error and clear the
already set bpf_mthp_ops data.
- mthp_ext handles different "enum tva_type" values. For example, for
small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
size. Under high memory pressure, only 4KB is used for
TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
collapse all mthp size.
- selftest
If there are additional scenarios, please let me know as well, so I can
conduct further prototype verification tests to make mTHP more
transparent.
If any of the above the strategies can be integrated into the kernel,
please let me know. I would be delighted to incorporate these strategies
into the kernel.
This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.
Thank you very much for your comments and discussions.
[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
[2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
[3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
[4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext
Vernon Yang (4):
psi: add psi_group_flush_stats() function
bpf: add bpf_cgroup_{flush_stats,stall} function
mm: introduce bpf_mthp_ops struct ops
samples: bpf: add mthp_ext
MAINTAINERS | 3 +
include/linux/bpf_huge_memory.h | 35 ++++
include/linux/cgroup-defs.h | 1 +
include/linux/huge_mm.h | 6 +
include/linux/psi.h | 1 +
kernel/bpf/helpers.c | 29 +++
kernel/sched/psi.c | 34 +++-
mm/Kconfig | 14 ++
mm/Makefile | 1 +
mm/bpf_huge_memory.c | 169 ++++++++++++++++
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 7 +-
samples/bpf/mthp_ext.bpf.c | 142 +++++++++++++
samples/bpf/mthp_ext.c | 340 ++++++++++++++++++++++++++++++++
samples/bpf/mthp_ext.h | 30 +++
15 files changed, 804 insertions(+), 9 deletions(-)
create mode 100644 include/linux/bpf_huge_memory.h
create mode 100644 mm/bpf_huge_memory.c
create mode 100644 samples/bpf/mthp_ext.bpf.c
create mode 100644 samples/bpf/mthp_ext.c
create mode 100644 samples/bpf/mthp_ext.h
--
2.53.0
On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Hi all,
>
> Background
> ==========
>
> As is well known, a system can simultaneously run multiple different
> scenarios. However, THP is not beneficial in every scenario — it is only
> most suitable for memory-intensive applications that are not sensitive
> to tail latency. For example, Redis, which is sensitive to tail latency,
> is not suitable for THP. But in practice, due to Redis issues, the
> entire THP functionality is often turned off, preventing other scenarios
> from benefiting from it.
>
> There are also some embedded scenarios (e.g. Android) that directly use
> 2MB THP, where the granularity is too large. Therefore, we introduced
> mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> still globally fix a single mTHP size and are unable to automatically
> select different mTHP sizes based on different scenarios.
>
> After testing, it was found that
>
> - When the system has a lot of free memory, it is normal for Redis to
> use mTHP. performance degradation in Redis only occurs when the system
> is under high memory pressure.
> - Additionally, when a large number of small-memory processes use mTHP,
> memory waste is prone to occur, and performance degradation may also
> happen during fast memory allocation/release.
>
> Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> following issues.
>
> - It breaks the cgroup hierarchy property.
> - Add new THP knobs, making sysadmin's job more complex
>
> Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> following issues.
>
> - It didn't address the issue on the per-process mode.
> - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> the same objective, there is no need to add two mechanisms for the
> same purpose.
> - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
Hello,
The primary hurdles preventing BPF-THP from being upstreamed are:
- Uncertainty regarding the stable API
It remains unclear whether the following API is sufficient for
BPF-THP requirements:
unsigned long
bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
unsigned long orders)
{
return orders;
}
- Ongoing integration of cgroups and struct-ops
Work is still in progress to integrate cgroup support with
struct_ops (see also
https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
We should wait for this infrastructure to land before introducing
new cgroup-based struct_ops.
> faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> implementation.
> - The test cases are too simplistic, lacking eBPF cases similar to real
> workloads such as sched_ext.
>
> If I miss some thing, please let me know. Thanks!
BTW, it would be better to include the original authors in the CC
list, especially since their work is cited in your commit message. ;)
--
Regards
Yafang
On Thu, May 7, 2026 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> >
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > Hi all,
> >
> > Background
> > ==========
> >
> > As is well known, a system can simultaneously run multiple different
> > scenarios. However, THP is not beneficial in every scenario — it is only
> > most suitable for memory-intensive applications that are not sensitive
> > to tail latency. For example, Redis, which is sensitive to tail latency,
> > is not suitable for THP. But in practice, due to Redis issues, the
> > entire THP functionality is often turned off, preventing other scenarios
> > from benefiting from it.
> >
> > There are also some embedded scenarios (e.g. Android) that directly use
> > 2MB THP, where the granularity is too large. Therefore, we introduced
> > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > still globally fix a single mTHP size and are unable to automatically
> > select different mTHP sizes based on different scenarios.
> >
> > After testing, it was found that
> >
> > - When the system has a lot of free memory, it is normal for Redis to
> > use mTHP. performance degradation in Redis only occurs when the system
> > is under high memory pressure.
> > - Additionally, when a large number of small-memory processes use mTHP,
> > memory waste is prone to occur, and performance degradation may also
> > happen during fast memory allocation/release.
> >
> > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > following issues.
> >
> > - It breaks the cgroup hierarchy property.
> > - Add new THP knobs, making sysadmin's job more complex
> >
> > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > following issues.
> >
> > - It didn't address the issue on the per-process mode.
> > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> > the same objective, there is no need to add two mechanisms for the
> > same purpose.
> > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
>
> Hello,
>
> The primary hurdles preventing BPF-THP from being upstreamed are:
>
> - Uncertainty regarding the stable API
>
> It remains unclear whether the following API is sufficient for
> BPF-THP requirements:
Thank you for pointing this out. I will add it to this issue list in
the next version.
At the same time, I also encourage everyone to actively provide
relevant real workload scenarios. I will conduct related analyses and
develop the mthp_ext to address them, so we can determine which ABI
can satisfy the BPF-THP requirements.
> unsigned long
> bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
> unsigned long orders)
> {
> return orders;
> }
>
> - Ongoing integration of cgroups and struct-ops
>
> Work is still in progress to integrate cgroup support with
> struct_ops (see also
> https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
> We should wait for this infrastructure to land before introducing
> new cgroup-based struct_ops.
This work can be carried out in parallel with integrating cgroup
support with struct_ops. I will focus on addressing real workload
scenarios to further clear/stabilize the BPF-THP ABI.
> > faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> > cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> > implementation.
> > - The test cases are too simplistic, lacking eBPF cases similar to real
> > workloads such as sched_ext.
> >
> > If I miss some thing, please let me know. Thanks!
>
> BTW, it would be better to include the original authors in the CC
> list, especially since their work is cited in your commit message. ;)
OK, I will CC the relevant authors in the next version.
--
Cheers,
Vernon
On Thu, May 7, 2026 at 8:51 PM Vernon Yang <vernon2gm@gmail.com> wrote:
>
> On Thu, May 7, 2026 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> > >
> > > From: Vernon Yang <yanglincheng@kylinos.cn>
> > >
> > > Hi all,
> > >
> > > Background
> > > ==========
> > >
> > > As is well known, a system can simultaneously run multiple different
> > > scenarios. However, THP is not beneficial in every scenario — it is only
> > > most suitable for memory-intensive applications that are not sensitive
> > > to tail latency. For example, Redis, which is sensitive to tail latency,
> > > is not suitable for THP. But in practice, due to Redis issues, the
> > > entire THP functionality is often turned off, preventing other scenarios
> > > from benefiting from it.
> > >
> > > There are also some embedded scenarios (e.g. Android) that directly use
> > > 2MB THP, where the granularity is too large. Therefore, we introduced
> > > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > > still globally fix a single mTHP size and are unable to automatically
> > > select different mTHP sizes based on different scenarios.
> > >
> > > After testing, it was found that
> > >
> > > - When the system has a lot of free memory, it is normal for Redis to
> > > use mTHP. performance degradation in Redis only occurs when the system
> > > is under high memory pressure.
> > > - Additionally, when a large number of small-memory processes use mTHP,
> > > memory waste is prone to occur, and performance degradation may also
> > > happen during fast memory allocation/release.
> > >
> > > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > > following issues.
> > >
> > > - It breaks the cgroup hierarchy property.
> > > - Add new THP knobs, making sysadmin's job more complex
> > >
> > > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > > following issues.
> > >
> > > - It didn't address the issue on the per-process mode.
> > > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> > > the same objective, there is no need to add two mechanisms for the
> > > same purpose.
> > > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
> >
> > Hello,
> >
> > The primary hurdles preventing BPF-THP from being upstreamed are:
> >
> > - Uncertainty regarding the stable API
> >
> > It remains unclear whether the following API is sufficient for
> > BPF-THP requirements:
>
> Thank you for pointing this out. I will add it to this issue list in
> the next version.
>
> At the same time, I also encourage everyone to actively provide
> relevant real workload scenarios. I will conduct related analyses and
> develop the mthp_ext to address them, so we can determine which ABI
> can satisfy the BPF-THP requirements.
>
> > unsigned long
> > bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
> > unsigned long orders)
> > {
> > return orders;
> > }
> >
> > - Ongoing integration of cgroups and struct-ops
> >
> > Work is still in progress to integrate cgroup support with
> > struct_ops (see also
> > https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
> > We should wait for this infrastructure to land before introducing
> > new cgroup-based struct_ops.
>
> This work can be carried out in parallel with integrating cgroup
> support with struct_ops. I will focus on addressing real workload
> scenarios to further clear/stabilize the BPF-THP ABI.
>
> > > faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> > > cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> > > implementation.
> > > - The test cases are too simplistic, lacking eBPF cases similar to real
> > > workloads such as sched_ext.
> > >
> > > If I miss some thing, please let me know. Thanks!
> >
> > BTW, it would be better to include the original authors in the CC
> > list, especially since their work is cited in your commit message. ;)
>
> OK, I will CC the relevant authors in the next version.
Hello Vernon,
I believe it would be best to hold off until David provides guidance
on the future direction. While I am not currently active on BPF-THP,
we are still looking for the right opportunity to upstream it. The
primary difference is that your implementation is cgroup-based;
however, we also plan to switch to that approach once Roman’s work
lands.
I don't mean to imply that BPF-THP is solely my project, but I suspect
you will eventually arrive at a similar implementation to what I’ve
developed. So far, I haven’t found a more efficient API than the
following:
unsigned long
bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
unsigned long orders)
{
return orders;
}
--
Regards
Yafang
On Thu, May 7, 2026 at 9:19 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, May 7, 2026 at 8:51 PM Vernon Yang <vernon2gm@gmail.com> wrote:
> >
> > On Thu, May 7, 2026 at 11:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Mon, May 4, 2026 at 12:52 AM Vernon Yang <vernon2gm@gmail.com> wrote:
> > > >
> > > > From: Vernon Yang <yanglincheng@kylinos.cn>
> > > >
> > > > Hi all,
> > > >
> > > > Background
> > > > ==========
> > > >
> > > > As is well known, a system can simultaneously run multiple different
> > > > scenarios. However, THP is not beneficial in every scenario — it is only
> > > > most suitable for memory-intensive applications that are not sensitive
> > > > to tail latency. For example, Redis, which is sensitive to tail latency,
> > > > is not suitable for THP. But in practice, due to Redis issues, the
> > > > entire THP functionality is often turned off, preventing other scenarios
> > > > from benefiting from it.
> > > >
> > > > There are also some embedded scenarios (e.g. Android) that directly use
> > > > 2MB THP, where the granularity is too large. Therefore, we introduced
> > > > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > > > still globally fix a single mTHP size and are unable to automatically
> > > > select different mTHP sizes based on different scenarios.
> > > >
> > > > After testing, it was found that
> > > >
> > > > - When the system has a lot of free memory, it is normal for Redis to
> > > > use mTHP. performance degradation in Redis only occurs when the system
> > > > is under high memory pressure.
> > > > - Additionally, when a large number of small-memory processes use mTHP,
> > > > memory waste is prone to occur, and performance degradation may also
> > > > happen during fast memory allocation/release.
> > > >
> > > > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > > > following issues.
> > > >
> > > > - It breaks the cgroup hierarchy property.
> > > > - Add new THP knobs, making sysadmin's job more complex
> > > >
> > > > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > > > following issues.
> > > >
> > > > - It didn't address the issue on the per-process mode.
> > > > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> > > > the same objective, there is no need to add two mechanisms for the
> > > > same purpose.
> > > > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
> > >
> > > Hello,
> > >
> > > The primary hurdles preventing BPF-THP from being upstreamed are:
> > >
> > > - Uncertainty regarding the stable API
> > >
> > > It remains unclear whether the following API is sufficient for
> > > BPF-THP requirements:
> >
> > Thank you for pointing this out. I will add it to this issue list in
> > the next version.
> >
> > At the same time, I also encourage everyone to actively provide
> > relevant real workload scenarios. I will conduct related analyses and
> > develop the mthp_ext to address them, so we can determine which ABI
> > can satisfy the BPF-THP requirements.
> >
> > > unsigned long
> > > bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
> > > unsigned long orders)
> > > {
> > > return orders;
> > > }
> > >
> > > - Ongoing integration of cgroups and struct-ops
> > >
> > > Work is still in progress to integrate cgroup support with
> > > struct_ops (see also
> > > https://lore.kernel.org/linux-mm/87cy439x8a.fsf@linux.dev/).
> > > We should wait for this infrastructure to land before introducing
> > > new cgroup-based struct_ops.
> >
> > This work can be carried out in parallel with integrating cgroup
> > support with struct_ops. I will focus on addressing real workload
> > scenarios to further clear/stabilize the BPF-THP ABI.
> >
> > > > faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> > > > cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> > > > implementation.
> > > > - The test cases are too simplistic, lacking eBPF cases similar to real
> > > > workloads such as sched_ext.
> > > >
> > > > If I miss some thing, please let me know. Thanks!
> > >
> > > BTW, it would be better to include the original authors in the CC
> > > list, especially since their work is cited in your commit message. ;)
> >
> > OK, I will CC the relevant authors in the next version.
>
> Hello Vernon,
>
> I believe it would be best to hold off until David provides guidance
> on the future direction. While I am not currently active on BPF-THP,
> we are still looking for the right opportunity to upstream it. The
> primary difference is that your implementation is cgroup-based;
> however, we also plan to switch to that approach once Roman’s work
> lands.
LSF/MM/BPF has just concluded, I think David will soon share some new
directions with us.
If it is merely about converting to cgroup implementation for
per-process version (I have already completed development based on
cgroup-bpf), there is no need for you to start from scratch. You can
directly test whether mthp_ext addresses your real workload scenarios.
Based on the current patchset, we can collaborate on development, as
our shared goal is to get this done right.
--
Cheers,
Vernon
© 2016 - 2026 Red Hat, Inc.