This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
For example, workloads running in specific containers or cgroups.
- Allocation context
Whether the allocation occurs during a page fault, khugepaged, swap or
other paths.
- VMA's memory advice settings
MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
PSI system data or associated cgroup PSI metrics
The kernel API of this new BPF hook is as follows,
/**
* thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
* @vma: vm_area_struct associated with the THP allocation
* @type: TVA type for current @vma
* @orders: Bitmask of available THP orders for this allocation
*
* Return: The suggested THP order for allocation from the BPF program. Must be
* a valid, available order.
*/
typedef int thp_order_fn_t(struct vm_area_struct *vma,
enum tva_type type,
unsigned long orders);
Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.
This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.
This BPF hook enables the implementation of flexible THP allocation
policies at the system, per-cgroup, or per-task level.
This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note
that this capability is currently unstable and may undergo significant
changes—including potential removal—in future kernel versions.
Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
---
MAINTAINERS | 1 +
include/linux/huge_mm.h | 23 +++++
mm/Kconfig | 11 +++
mm/Makefile | 1 +
mm/huge_memory_bpf.c | 204 ++++++++++++++++++++++++++++++++++++++++
5 files changed, 240 insertions(+)
create mode 100644 mm/huge_memory_bpf.c
diff --git a/MAINTAINERS b/MAINTAINERS
index ca8e3d18eedd..7be34b2a64fd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16257,6 +16257,7 @@ F: include/linux/huge_mm.h
F: include/linux/khugepaged.h
F: include/trace/events/huge_memory.h
F: mm/huge_memory.c
+F: mm/huge_memory_bpf.c
F: mm/khugepaged.c
F: mm/mm_slot.h
F: tools/testing/selftests/mm/khugepaged.c
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a635dcbb2b99..02055cc93bfe 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+ TRANSPARENT_HUGEPAGE_BPF_ATTACHED, /* BPF prog is attached */
};
struct kobject;
@@ -269,6 +270,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
enum tva_type type,
unsigned long orders);
+#ifdef CONFIG_BPF_THP
+
+unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders);
+
+#else
+
+static inline unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+ unsigned long orders)
+{
+ return orders;
+}
+
+#endif
+
/**
* thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
* @vma: the vm area to check
@@ -290,6 +308,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
{
vm_flags_t vm_flags = vma->vm_flags;
+ /* The BPF-specified order overrides which order is selected. */
+ orders &= bpf_hook_thp_get_orders(vma, type, orders);
+ if (!orders)
+ return 0;
+
/*
* Optimization to check if required orders are enabled early. Only
* forced collapse ignores sysfs configs.
diff --git a/mm/Kconfig b/mm/Kconfig
index bde9f842a4a8..ffbcc5febb10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -895,6 +895,17 @@ config NO_PAGE_MAPCOUNT
EXPERIMENTAL because the impact of some changes is still unclear.
+config BPF_THP
+ bool "BPF-based THP Policy (EXPERIMENTAL)"
+ depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+ help
+ Enable dynamic THP policy adjustment using BPF programs. This feature
+ is currently experimental.
+
+ WARNING: This feature is unstable and may change in future kernel
+ versions.
+
endif # TRANSPARENT_HUGEPAGE
# simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..4efca1c8a919 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_NUMA) += memory-tiers.o
obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_BPF_THP) += huge_memory_bpf.o
obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
new file mode 100644
index 000000000000..47c124d588b2
--- /dev/null
+++ b/mm/huge_memory_bpf.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF-based THP policy management
+ *
+ * Author: Yafang Shao <laoar.shao@gmail.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+/**
+ * @thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
+ * @vma: vm_area_struct associated with the THP allocation
+ * @type: TVA type for current @vma
+ * @orders: Bitmask of available THP orders for this allocation
+ *
+ * Return: The suggested THP order for allocation from the BPF program. Must be
+ * a valid, available order.
+ */
+typedef int thp_order_fn_t(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders);
+
+struct bpf_thp_ops {
+ thp_order_fn_t __rcu *thp_get_order;
+};
+
+static struct bpf_thp_ops bpf_thp;
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders)
+{
+ thp_order_fn_t *bpf_hook_thp_get_order;
+ int bpf_order;
+
+ /* No BPF program is attached */
+ if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+ &transparent_hugepage_flags))
+ return orders;
+
+ rcu_read_lock();
+ bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
+ if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
+ goto out;
+
+ bpf_order = bpf_hook_thp_get_order(vma, type, orders);
+ orders &= BIT(bpf_order);
+
+out:
+ rcu_read_unlock();
+ return orders;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+ .get_func_proto = bpf_thp_get_func_proto,
+ .is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+ return 0;
+}
+
+static int bpf_thp_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ /* The call site operates under RCU protection. */
+ if (prog->sleepable)
+ return -EINVAL;
+ return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_thp_ops *ops = kdata;
+
+ spin_lock(&thp_ops_lock);
+ if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+ &transparent_hugepage_flags)) {
+ spin_unlock(&thp_ops_lock);
+ return -EBUSY;
+ }
+ WARN_ON_ONCE(rcu_access_pointer(bpf_thp.thp_get_order));
+ rcu_assign_pointer(bpf_thp.thp_get_order, ops->thp_get_order);
+ spin_unlock(&thp_ops_lock);
+ return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+ thp_order_fn_t *old_fn;
+
+ spin_lock(&thp_ops_lock);
+ clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+ old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, NULL,
+ lockdep_is_held(&thp_ops_lock));
+ WARN_ON_ONCE(!old_fn);
+ spin_unlock(&thp_ops_lock);
+
+ synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+ thp_order_fn_t *old_fn, *new_fn;
+ struct bpf_thp_ops *old = old_kdata;
+ struct bpf_thp_ops *ops = kdata;
+ int ret = 0;
+
+ if (!ops || !old)
+ return -EINVAL;
+
+ spin_lock(&thp_ops_lock);
+ /* The prog has aleady been removed. */
+ if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+ &transparent_hugepage_flags)) {
+ ret = -ENOENT;
+ goto out;
+ }
+
+ new_fn = rcu_dereference(ops->thp_get_order);
+ old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, new_fn,
+ lockdep_is_held(&thp_ops_lock));
+ WARN_ON_ONCE(!old_fn || !new_fn);
+
+out:
+ spin_unlock(&thp_ops_lock);
+ if (!ret)
+ synchronize_rcu();
+ return ret;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+ struct bpf_thp_ops *ops = kdata;
+
+ if (!ops->thp_get_order) {
+ pr_err("bpf_thp: required ops isn't implemented\n");
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int bpf_thp_get_order(struct vm_area_struct *vma,
+ enum tva_type type,
+ unsigned long orders)
+{
+ return -1;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+ .thp_get_order = (thp_order_fn_t __rcu *)bpf_thp_get_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+ .verifier_ops = &thp_bpf_verifier_ops,
+ .init = bpf_thp_init,
+ .check_member = bpf_thp_check_member,
+ .init_member = bpf_thp_init_member,
+ .reg = bpf_thp_reg,
+ .unreg = bpf_thp_unreg,
+ .update = bpf_thp_update,
+ .validate = bpf_thp_validate,
+ .cfi_stubs = &__bpf_thp_ops,
+ .owner = THIS_MODULE,
+ .name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+ int err;
+
+ err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+ if (err)
+ pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+ return err;
+}
+late_initcall(bpf_thp_ops_init);
--
2.47.3
On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> + enum tva_type type,
> + unsigned long orders)
> +{
> + thp_order_fn_t *bpf_hook_thp_get_order;
> + int bpf_order;
> +
> + /* No BPF program is attached */
> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> + &transparent_hugepage_flags))
> + return orders;
> +
> + rcu_read_lock();
> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> + goto out;
> +
> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> + orders &= BIT(bpf_order);
> +
> +out:
> + rcu_read_unlock();
> + return orders;
> +}
I thought I explained it earlier.
Nack to a single global prog approach.
The logic must accommodate multiple programs per-container
or any other way from the beginning.
If cgroup based scoping doesn't fit use per process tree scoping.
On 03.10.25 04:18, Alexei Starovoitov wrote:
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>> + enum tva_type type,
>> + unsigned long orders)
>> +{
>> + thp_order_fn_t *bpf_hook_thp_get_order;
>> + int bpf_order;
>> +
>> + /* No BPF program is attached */
>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>> + &transparent_hugepage_flags))
>> + return orders;
>> +
>> + rcu_read_lock();
>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>> + goto out;
>> +
>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>> + orders &= BIT(bpf_order);
>> +
>> +out:
>> + rcu_read_unlock();
>> + return orders;
>> +}
>
> I thought I explained it earlier.
> Nack to a single global prog approach.
I agree. We should have the option to either specify a policy globally,
or more refined for cgroups/processes.
It's an interesting question if a program would ever want to ship its
own policy: I can see use cases for that.
So I agree that we should make it more flexible right from the start.
--
Cheers
David / dhildenb
On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.10.25 04:18, Alexei Starovoitov wrote:
> > On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >> + enum tva_type type,
> >> + unsigned long orders)
> >> +{
> >> + thp_order_fn_t *bpf_hook_thp_get_order;
> >> + int bpf_order;
> >> +
> >> + /* No BPF program is attached */
> >> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >> + &transparent_hugepage_flags))
> >> + return orders;
> >> +
> >> + rcu_read_lock();
> >> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >> + goto out;
> >> +
> >> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >> + orders &= BIT(bpf_order);
> >> +
> >> +out:
> >> + rcu_read_unlock();
> >> + return orders;
> >> +}
> >
> > I thought I explained it earlier.
> > Nack to a single global prog approach.
>
> I agree. We should have the option to either specify a policy globally,
> or more refined for cgroups/processes.
>
> It's an interesting question if a program would ever want to ship its
> own policy: I can see use cases for that.
>
> So I agree that we should make it more flexible right from the start.
To achieve per-process granularity, the struct-ops must be embedded
within the mm_struct as follows:
+#ifdef CONFIG_BPF_MM
+struct bpf_mm_ops {
+#ifdef CONFIG_BPF_THP
+ struct bpf_thp_ops bpf_thp;
+#endif
+};
+#endif
+
/*
* Opaque type representing current mm_struct flag state. Must be accessed via
* mm_flags_xxx() helper functions.
@@ -1268,6 +1281,10 @@ struct mm_struct {
#ifdef CONFIG_MM_ID
mm_id_t mm_id;
#endif /* CONFIG_MM_ID */
+
+#ifdef CONFIG_BPF_MM
+ struct bpf_mm_ops bpf_mm;
+#endif
} __randomize_layout;
We should be aware that this will involve extensive changes in mm/. If
we're aligned on this direction, I'll start working on the patches.
--
Regards
Yafang
On 08.10.25 10:18, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>
>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>> + enum tva_type type,
>>>> + unsigned long orders)
>>>> +{
>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
>>>> + int bpf_order;
>>>> +
>>>> + /* No BPF program is attached */
>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>> + &transparent_hugepage_flags))
>>>> + return orders;
>>>> +
>>>> + rcu_read_lock();
>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>> + goto out;
>>>> +
>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>> + orders &= BIT(bpf_order);
>>>> +
>>>> +out:
>>>> + rcu_read_unlock();
>>>> + return orders;
>>>> +}
>>>
>>> I thought I explained it earlier.
>>> Nack to a single global prog approach.
>>
>> I agree. We should have the option to either specify a policy globally,
>> or more refined for cgroups/processes.
>>
>> It's an interesting question if a program would ever want to ship its
>> own policy: I can see use cases for that.
>>
>> So I agree that we should make it more flexible right from the start.
>
> To achieve per-process granularity, the struct-ops must be embedded
> within the mm_struct as follows:
>
> +#ifdef CONFIG_BPF_MM
> +struct bpf_mm_ops {
> +#ifdef CONFIG_BPF_THP
> + struct bpf_thp_ops bpf_thp;
> +#endif
> +};
> +#endif
> +
> /*
> * Opaque type representing current mm_struct flag state. Must be accessed via
> * mm_flags_xxx() helper functions.
> @@ -1268,6 +1281,10 @@ struct mm_struct {
> #ifdef CONFIG_MM_ID
> mm_id_t mm_id;
> #endif /* CONFIG_MM_ID */
> +
> +#ifdef CONFIG_BPF_MM
> + struct bpf_mm_ops bpf_mm;
> +#endif
> } __randomize_layout;
>
> We should be aware that this will involve extensive changes in mm/.
That's what we do on linux-mm :)
It would be great to use Alexei's feedback/experience to come up with
something that is flexible for various use cases.
So I think this is likely the right direction.
It would be great to evaluate which scenarios we could unlock with this
(global vs. per-process vs. per-cgroup) approach, and how
extensive/involved the changes will be.
If we need a slot in the bi-weekly mm alignment session to brainstorm,
we can ask Dave R. for one in the upcoming weeks.
--
Cheers
David / dhildenb
On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 10:18, Yafang Shao wrote:
> > On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>
> >>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>> + enum tva_type type,
> >>>> + unsigned long orders)
> >>>> +{
> >>>> + thp_order_fn_t *bpf_hook_thp_get_order;
> >>>> + int bpf_order;
> >>>> +
> >>>> + /* No BPF program is attached */
> >>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>> + &transparent_hugepage_flags))
> >>>> + return orders;
> >>>> +
> >>>> + rcu_read_lock();
> >>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>> + goto out;
> >>>> +
> >>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>> + orders &= BIT(bpf_order);
> >>>> +
> >>>> +out:
> >>>> + rcu_read_unlock();
> >>>> + return orders;
> >>>> +}
> >>>
> >>> I thought I explained it earlier.
> >>> Nack to a single global prog approach.
> >>
> >> I agree. We should have the option to either specify a policy globally,
> >> or more refined for cgroups/processes.
> >>
> >> It's an interesting question if a program would ever want to ship its
> >> own policy: I can see use cases for that.
> >>
> >> So I agree that we should make it more flexible right from the start.
> >
> > To achieve per-process granularity, the struct-ops must be embedded
> > within the mm_struct as follows:
> >
> > +#ifdef CONFIG_BPF_MM
> > +struct bpf_mm_ops {
> > +#ifdef CONFIG_BPF_THP
> > + struct bpf_thp_ops bpf_thp;
> > +#endif
> > +};
> > +#endif
> > +
> > /*
> > * Opaque type representing current mm_struct flag state. Must be accessed via
> > * mm_flags_xxx() helper functions.
> > @@ -1268,6 +1281,10 @@ struct mm_struct {
> > #ifdef CONFIG_MM_ID
> > mm_id_t mm_id;
> > #endif /* CONFIG_MM_ID */
> > +
> > +#ifdef CONFIG_BPF_MM
> > + struct bpf_mm_ops bpf_mm;
> > +#endif
> > } __randomize_layout;
> >
> > We should be aware that this will involve extensive changes in mm/.
>
> That's what we do on linux-mm :)
>
> It would be great to use Alexei's feedback/experience to come up with
> something that is flexible for various use cases.
I'm still not entirely convinced that allowing individual processes or
cgroups to run independent progs is a valid use case. However, since
we have a consensus that this is the right direction, I will proceed
with this approach.
>
> So I think this is likely the right direction.
>
> It would be great to evaluate which scenarios we could unlock with this
> (global vs. per-process vs. per-cgroup) approach, and how
> extensive/involved the changes will be.
1. Global Approach
- Pros:
Simple;
Can manage different THP policies for different cgroups or processes.
- Cons:
Does not allow individual processes to run their own BPF programs.
2. Per-Process Approach
- Pros:
Enables each process to run its own BPF program.
- Cons:
Introduces significant complexity, as it requires handling the
BPF program's lifecycle (creation, destruction, inheritance) within
every mm_struct.
3. Per-Cgroup Approach
- Pros:
Allows individual cgroups to run their own BPF programs.
Less complex than the per-process model, as it can leverage the
existing cgroup operations structure.
- Cons:
Creates a dependency on the cgroup subsystem.
might not be easy to control at the per-process level.
>
> If we need a slot in the bi-weekly mm alignment session to brainstorm,
> we can ask Dave R. for one in the upcoming weeks.
I will draft an RFC to outline the required changes in both the mm/
and bpf/ subsystems and solicit feedback.
--
Regards
Yafang
On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 10:18, Yafang Shao wrote:
>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>
>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>> + enum tva_type type,
>>>>>> + unsigned long orders)
>>>>>> +{
>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>> + int bpf_order;
>>>>>> +
>>>>>> + /* No BPF program is attached */
>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>> + &transparent_hugepage_flags))
>>>>>> + return orders;
>>>>>> +
>>>>>> + rcu_read_lock();
>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>> + goto out;
>>>>>> +
>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>> + orders &= BIT(bpf_order);
>>>>>> +
>>>>>> +out:
>>>>>> + rcu_read_unlock();
>>>>>> + return orders;
>>>>>> +}
>>>>>
>>>>> I thought I explained it earlier.
>>>>> Nack to a single global prog approach.
>>>>
>>>> I agree. We should have the option to either specify a policy globally,
>>>> or more refined for cgroups/processes.
>>>>
>>>> It's an interesting question if a program would ever want to ship its
>>>> own policy: I can see use cases for that.
>>>>
>>>> So I agree that we should make it more flexible right from the start.
>>>
>>> To achieve per-process granularity, the struct-ops must be embedded
>>> within the mm_struct as follows:
>>>
>>> +#ifdef CONFIG_BPF_MM
>>> +struct bpf_mm_ops {
>>> +#ifdef CONFIG_BPF_THP
>>> + struct bpf_thp_ops bpf_thp;
>>> +#endif
>>> +};
>>> +#endif
>>> +
>>> /*
>>> * Opaque type representing current mm_struct flag state. Must be accessed via
>>> * mm_flags_xxx() helper functions.
>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>> #ifdef CONFIG_MM_ID
>>> mm_id_t mm_id;
>>> #endif /* CONFIG_MM_ID */
>>> +
>>> +#ifdef CONFIG_BPF_MM
>>> + struct bpf_mm_ops bpf_mm;
>>> +#endif
>>> } __randomize_layout;
>>>
>>> We should be aware that this will involve extensive changes in mm/.
>>
>> That's what we do on linux-mm :)
>>
>> It would be great to use Alexei's feedback/experience to come up with
>> something that is flexible for various use cases.
>
> I'm still not entirely convinced that allowing individual processes or
> cgroups to run independent progs is a valid use case. However, since
> we have a consensus that this is the right direction, I will proceed
> with this approach.
>
>>
>> So I think this is likely the right direction.
>>
>> It would be great to evaluate which scenarios we could unlock with this
>> (global vs. per-process vs. per-cgroup) approach, and how
>> extensive/involved the changes will be.
>
> 1. Global Approach
> - Pros:
> Simple;
> Can manage different THP policies for different cgroups or processes.
> - Cons:
> Does not allow individual processes to run their own BPF programs.
>
> 2. Per-Process Approach
> - Pros:
> Enables each process to run its own BPF program.
> - Cons:
> Introduces significant complexity, as it requires handling the
> BPF program's lifecycle (creation, destruction, inheritance) within
> every mm_struct.
>
> 3. Per-Cgroup Approach
> - Pros:
> Allows individual cgroups to run their own BPF programs.
> Less complex than the per-process model, as it can leverage the
> existing cgroup operations structure.
> - Cons:
> Creates a dependency on the cgroup subsystem.
> might not be easy to control at the per-process level.
Another issue is that how and who to deal with hierarchical cgroup, where one
cgroup is a parent of another. Should bpf program to do that or mm code
to do that? I remember hierarchical cgroup is the main reason THP control
at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
get the same rejection from cgroup folks?
>
>>
>> If we need a slot in the bi-weekly mm alignment session to brainstorm,
>> we can ask Dave R. for one in the upcoming weeks.
>
> I will draft an RFC to outline the required changes in both the mm/
> and bpf/ subsystems and solicit feedback.
>
> --
> Regards
> Yafang
--
Best Regards,
Yan, Zi
On 08.10.25 13:27, Zi Yan wrote:
> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>
>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>
>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>> + enum tva_type type,
>>>>>>> + unsigned long orders)
>>>>>>> +{
>>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>> + int bpf_order;
>>>>>>> +
>>>>>>> + /* No BPF program is attached */
>>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>> + &transparent_hugepage_flags))
>>>>>>> + return orders;
>>>>>>> +
>>>>>>> + rcu_read_lock();
>>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>> + goto out;
>>>>>>> +
>>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>> + orders &= BIT(bpf_order);
>>>>>>> +
>>>>>>> +out:
>>>>>>> + rcu_read_unlock();
>>>>>>> + return orders;
>>>>>>> +}
>>>>>>
>>>>>> I thought I explained it earlier.
>>>>>> Nack to a single global prog approach.
>>>>>
>>>>> I agree. We should have the option to either specify a policy globally,
>>>>> or more refined for cgroups/processes.
>>>>>
>>>>> It's an interesting question if a program would ever want to ship its
>>>>> own policy: I can see use cases for that.
>>>>>
>>>>> So I agree that we should make it more flexible right from the start.
>>>>
>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>> within the mm_struct as follows:
>>>>
>>>> +#ifdef CONFIG_BPF_MM
>>>> +struct bpf_mm_ops {
>>>> +#ifdef CONFIG_BPF_THP
>>>> + struct bpf_thp_ops bpf_thp;
>>>> +#endif
>>>> +};
>>>> +#endif
>>>> +
>>>> /*
>>>> * Opaque type representing current mm_struct flag state. Must be accessed via
>>>> * mm_flags_xxx() helper functions.
>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>> #ifdef CONFIG_MM_ID
>>>> mm_id_t mm_id;
>>>> #endif /* CONFIG_MM_ID */
>>>> +
>>>> +#ifdef CONFIG_BPF_MM
>>>> + struct bpf_mm_ops bpf_mm;
>>>> +#endif
>>>> } __randomize_layout;
>>>>
>>>> We should be aware that this will involve extensive changes in mm/.
>>>
>>> That's what we do on linux-mm :)
>>>
>>> It would be great to use Alexei's feedback/experience to come up with
>>> something that is flexible for various use cases.
>>
>> I'm still not entirely convinced that allowing individual processes or
>> cgroups to run independent progs is a valid use case. However, since
>> we have a consensus that this is the right direction, I will proceed
>> with this approach.
>>
>>>
>>> So I think this is likely the right direction.
>>>
>>> It would be great to evaluate which scenarios we could unlock with this
>>> (global vs. per-process vs. per-cgroup) approach, and how
>>> extensive/involved the changes will be.
>>
>> 1. Global Approach
>> - Pros:
>> Simple;
>> Can manage different THP policies for different cgroups or processes.
>> - Cons:
>> Does not allow individual processes to run their own BPF programs.
>>
>> 2. Per-Process Approach
>> - Pros:
>> Enables each process to run its own BPF program.
>> - Cons:
>> Introduces significant complexity, as it requires handling the
>> BPF program's lifecycle (creation, destruction, inheritance) within
>> every mm_struct.
>>
>> 3. Per-Cgroup Approach
>> - Pros:
>> Allows individual cgroups to run their own BPF programs.
>> Less complex than the per-process model, as it can leverage the
>> existing cgroup operations structure.
>> - Cons:
>> Creates a dependency on the cgroup subsystem.
>> might not be easy to control at the per-process level.
>
> Another issue is that how and who to deal with hierarchical cgroup, where one
> cgroup is a parent of another. Should bpf program to do that or mm code
> to do that? I remember hierarchical cgroup is the main reason THP control
> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> get the same rejection from cgroup folks?
Valid point.
I do wonder if that problem was already encountered elsewhere with bpf
and if there is already a solution.
Focusing on processes instead of cgroups might be easier initially.
--
Cheers
David / dhildenb
On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 13:27, Zi Yan wrote:
> > On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >
> >> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>
> >>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>> + enum tva_type type,
> >>>>>>> + unsigned long orders)
> >>>>>>> +{
> >>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>> + int bpf_order;
> >>>>>>> +
> >>>>>>> + /* No BPF program is attached */
> >>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>> + &transparent_hugepage_flags))
> >>>>>>> + return orders;
> >>>>>>> +
> >>>>>>> + rcu_read_lock();
> >>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>> + goto out;
> >>>>>>> +
> >>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>> + orders &= BIT(bpf_order);
> >>>>>>> +
> >>>>>>> +out:
> >>>>>>> + rcu_read_unlock();
> >>>>>>> + return orders;
> >>>>>>> +}
> >>>>>>
> >>>>>> I thought I explained it earlier.
> >>>>>> Nack to a single global prog approach.
> >>>>>
> >>>>> I agree. We should have the option to either specify a policy globally,
> >>>>> or more refined for cgroups/processes.
> >>>>>
> >>>>> It's an interesting question if a program would ever want to ship its
> >>>>> own policy: I can see use cases for that.
> >>>>>
> >>>>> So I agree that we should make it more flexible right from the start.
> >>>>
> >>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>> within the mm_struct as follows:
> >>>>
> >>>> +#ifdef CONFIG_BPF_MM
> >>>> +struct bpf_mm_ops {
> >>>> +#ifdef CONFIG_BPF_THP
> >>>> + struct bpf_thp_ops bpf_thp;
> >>>> +#endif
> >>>> +};
> >>>> +#endif
> >>>> +
> >>>> /*
> >>>> * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>> * mm_flags_xxx() helper functions.
> >>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>> #ifdef CONFIG_MM_ID
> >>>> mm_id_t mm_id;
> >>>> #endif /* CONFIG_MM_ID */
> >>>> +
> >>>> +#ifdef CONFIG_BPF_MM
> >>>> + struct bpf_mm_ops bpf_mm;
> >>>> +#endif
> >>>> } __randomize_layout;
> >>>>
> >>>> We should be aware that this will involve extensive changes in mm/.
> >>>
> >>> That's what we do on linux-mm :)
> >>>
> >>> It would be great to use Alexei's feedback/experience to come up with
> >>> something that is flexible for various use cases.
> >>
> >> I'm still not entirely convinced that allowing individual processes or
> >> cgroups to run independent progs is a valid use case. However, since
> >> we have a consensus that this is the right direction, I will proceed
> >> with this approach.
> >>
> >>>
> >>> So I think this is likely the right direction.
> >>>
> >>> It would be great to evaluate which scenarios we could unlock with this
> >>> (global vs. per-process vs. per-cgroup) approach, and how
> >>> extensive/involved the changes will be.
> >>
> >> 1. Global Approach
> >> - Pros:
> >> Simple;
> >> Can manage different THP policies for different cgroups or processes.
> >> - Cons:
> >> Does not allow individual processes to run their own BPF programs.
> >>
> >> 2. Per-Process Approach
> >> - Pros:
> >> Enables each process to run its own BPF program.
> >> - Cons:
> >> Introduces significant complexity, as it requires handling the
> >> BPF program's lifecycle (creation, destruction, inheritance) within
> >> every mm_struct.
> >>
> >> 3. Per-Cgroup Approach
> >> - Pros:
> >> Allows individual cgroups to run their own BPF programs.
> >> Less complex than the per-process model, as it can leverage the
> >> existing cgroup operations structure.
> >> - Cons:
> >> Creates a dependency on the cgroup subsystem.
> >> might not be easy to control at the per-process level.
> >
> > Another issue is that how and who to deal with hierarchical cgroup, where one
> > cgroup is a parent of another. Should bpf program to do that or mm code
> > to do that? I remember hierarchical cgroup is the main reason THP control
> > at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> > get the same rejection from cgroup folks?
>
> Valid point.
>
> I do wonder if that problem was already encountered elsewhere with bpf
> and if there is already a solution.
Our standard is to run only one instance of a BPF program type
system-wide to avoid conflicts. For example, we can't have both
systemd and a container runtime running bpf-thp simultaneously.
Perhaps Alexei can enlighten us, though we'd need to read between his
characteristically brief lines. ;-)
>
> Focusing on processes instead of cgroups might be easier initially.
--
Regards
Yafang
On 08.10.25 15:11, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 13:27, Zi Yan wrote:
>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>>
>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>>> + enum tva_type type,
>>>>>>>>> + unsigned long orders)
>>>>>>>>> +{
>>>>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>>> + int bpf_order;
>>>>>>>>> +
>>>>>>>>> + /* No BPF program is attached */
>>>>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>>> + &transparent_hugepage_flags))
>>>>>>>>> + return orders;
>>>>>>>>> +
>>>>>>>>> + rcu_read_lock();
>>>>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>>> + goto out;
>>>>>>>>> +
>>>>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>>> + orders &= BIT(bpf_order);
>>>>>>>>> +
>>>>>>>>> +out:
>>>>>>>>> + rcu_read_unlock();
>>>>>>>>> + return orders;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> I thought I explained it earlier.
>>>>>>>> Nack to a single global prog approach.
>>>>>>>
>>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>>> or more refined for cgroups/processes.
>>>>>>>
>>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>>> own policy: I can see use cases for that.
>>>>>>>
>>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>>
>>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>>> within the mm_struct as follows:
>>>>>>
>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>> +struct bpf_mm_ops {
>>>>>> +#ifdef CONFIG_BPF_THP
>>>>>> + struct bpf_thp_ops bpf_thp;
>>>>>> +#endif
>>>>>> +};
>>>>>> +#endif
>>>>>> +
>>>>>> /*
>>>>>> * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>> * mm_flags_xxx() helper functions.
>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>> #ifdef CONFIG_MM_ID
>>>>>> mm_id_t mm_id;
>>>>>> #endif /* CONFIG_MM_ID */
>>>>>> +
>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>> + struct bpf_mm_ops bpf_mm;
>>>>>> +#endif
>>>>>> } __randomize_layout;
>>>>>>
>>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>>
>>>>> That's what we do on linux-mm :)
>>>>>
>>>>> It would be great to use Alexei's feedback/experience to come up with
>>>>> something that is flexible for various use cases.
>>>>
>>>> I'm still not entirely convinced that allowing individual processes or
>>>> cgroups to run independent progs is a valid use case. However, since
>>>> we have a consensus that this is the right direction, I will proceed
>>>> with this approach.
>>>>
>>>>>
>>>>> So I think this is likely the right direction.
>>>>>
>>>>> It would be great to evaluate which scenarios we could unlock with this
>>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>>> extensive/involved the changes will be.
>>>>
>>>> 1. Global Approach
>>>> - Pros:
>>>> Simple;
>>>> Can manage different THP policies for different cgroups or processes.
>>>> - Cons:
>>>> Does not allow individual processes to run their own BPF programs.
>>>>
>>>> 2. Per-Process Approach
>>>> - Pros:
>>>> Enables each process to run its own BPF program.
>>>> - Cons:
>>>> Introduces significant complexity, as it requires handling the
>>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>>> every mm_struct.
>>>>
>>>> 3. Per-Cgroup Approach
>>>> - Pros:
>>>> Allows individual cgroups to run their own BPF programs.
>>>> Less complex than the per-process model, as it can leverage the
>>>> existing cgroup operations structure.
>>>> - Cons:
>>>> Creates a dependency on the cgroup subsystem.
>>>> might not be easy to control at the per-process level.
>>>
>>> Another issue is that how and who to deal with hierarchical cgroup, where one
>>> cgroup is a parent of another. Should bpf program to do that or mm code
>>> to do that? I remember hierarchical cgroup is the main reason THP control
>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>>> get the same rejection from cgroup folks?
>>
>> Valid point.
>>
>> I do wonder if that problem was already encountered elsewhere with bpf
>> and if there is already a solution.
>
> Our standard is to run only one instance of a BPF program type
> system-wide to avoid conflicts. For example, we can't have both
> systemd and a container runtime running bpf-thp simultaneously.
Right, it's a good question how to combine policies, or "who wins".
>
> Perhaps Alexei can enlighten us, though we'd need to read between his
> characteristically brief lines. ;-)
There might be some insights to be had in the bpf OOM discussion at
https://lkml.kernel.org/r/CAEf4BzafXv-PstSAP6krers=S74ri1+zTB4Y2oT6f+33yznqsA@mail.gmail.com
I didn't completely read through that, but that discussion also seems to
be about interaction between cgroups and bpd programs.
--
Cheers
David / dhildenb
On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 15:11, Yafang Shao wrote:
> > On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 13:27, Zi Yan wrote:
> >>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >>>
> >>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>>>> + enum tva_type type,
> >>>>>>>>> + unsigned long orders)
> >>>>>>>>> +{
> >>>>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>>>> + int bpf_order;
> >>>>>>>>> +
> >>>>>>>>> + /* No BPF program is attached */
> >>>>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>>>> + &transparent_hugepage_flags))
> >>>>>>>>> + return orders;
> >>>>>>>>> +
> >>>>>>>>> + rcu_read_lock();
> >>>>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>>>> + goto out;
> >>>>>>>>> +
> >>>>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>>>> + orders &= BIT(bpf_order);
> >>>>>>>>> +
> >>>>>>>>> +out:
> >>>>>>>>> + rcu_read_unlock();
> >>>>>>>>> + return orders;
> >>>>>>>>> +}
> >>>>>>>>
> >>>>>>>> I thought I explained it earlier.
> >>>>>>>> Nack to a single global prog approach.
> >>>>>>>
> >>>>>>> I agree. We should have the option to either specify a policy globally,
> >>>>>>> or more refined for cgroups/processes.
> >>>>>>>
> >>>>>>> It's an interesting question if a program would ever want to ship its
> >>>>>>> own policy: I can see use cases for that.
> >>>>>>>
> >>>>>>> So I agree that we should make it more flexible right from the start.
> >>>>>>
> >>>>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>>>> within the mm_struct as follows:
> >>>>>>
> >>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>> +struct bpf_mm_ops {
> >>>>>> +#ifdef CONFIG_BPF_THP
> >>>>>> + struct bpf_thp_ops bpf_thp;
> >>>>>> +#endif
> >>>>>> +};
> >>>>>> +#endif
> >>>>>> +
> >>>>>> /*
> >>>>>> * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>>> * mm_flags_xxx() helper functions.
> >>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>>> #ifdef CONFIG_MM_ID
> >>>>>> mm_id_t mm_id;
> >>>>>> #endif /* CONFIG_MM_ID */
> >>>>>> +
> >>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>> + struct bpf_mm_ops bpf_mm;
> >>>>>> +#endif
> >>>>>> } __randomize_layout;
> >>>>>>
> >>>>>> We should be aware that this will involve extensive changes in mm/.
> >>>>>
> >>>>> That's what we do on linux-mm :)
> >>>>>
> >>>>> It would be great to use Alexei's feedback/experience to come up with
> >>>>> something that is flexible for various use cases.
> >>>>
> >>>> I'm still not entirely convinced that allowing individual processes or
> >>>> cgroups to run independent progs is a valid use case. However, since
> >>>> we have a consensus that this is the right direction, I will proceed
> >>>> with this approach.
> >>>>
> >>>>>
> >>>>> So I think this is likely the right direction.
> >>>>>
> >>>>> It would be great to evaluate which scenarios we could unlock with this
> >>>>> (global vs. per-process vs. per-cgroup) approach, and how
> >>>>> extensive/involved the changes will be.
> >>>>
> >>>> 1. Global Approach
> >>>> - Pros:
> >>>> Simple;
> >>>> Can manage different THP policies for different cgroups or processes.
> >>>> - Cons:
> >>>> Does not allow individual processes to run their own BPF programs.
> >>>>
> >>>> 2. Per-Process Approach
> >>>> - Pros:
> >>>> Enables each process to run its own BPF program.
> >>>> - Cons:
> >>>> Introduces significant complexity, as it requires handling the
> >>>> BPF program's lifecycle (creation, destruction, inheritance) within
> >>>> every mm_struct.
> >>>>
> >>>> 3. Per-Cgroup Approach
> >>>> - Pros:
> >>>> Allows individual cgroups to run their own BPF programs.
> >>>> Less complex than the per-process model, as it can leverage the
> >>>> existing cgroup operations structure.
> >>>> - Cons:
> >>>> Creates a dependency on the cgroup subsystem.
> >>>> might not be easy to control at the per-process level.
> >>>
> >>> Another issue is that how and who to deal with hierarchical cgroup, where one
> >>> cgroup is a parent of another. Should bpf program to do that or mm code
> >>> to do that? I remember hierarchical cgroup is the main reason THP control
> >>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> >>> get the same rejection from cgroup folks?
> >>
> >> Valid point.
> >>
> >> I do wonder if that problem was already encountered elsewhere with bpf
> >> and if there is already a solution.
> >
> > Our standard is to run only one instance of a BPF program type
> > system-wide to avoid conflicts. For example, we can't have both
> > systemd and a container runtime running bpf-thp simultaneously.
>
> Right, it's a good question how to combine policies, or "who wins".
From my perspective, the ideal approach is to have one BPF-THP
instance per mm_struct. This allows for separate managers in different
domains, such as systemd managing BPF-THP for system processes and
containerd for container processes, while ensuring that any single
process is managed by only one BPF-THP.
>
> >
> > Perhaps Alexei can enlighten us, though we'd need to read between his
> > characteristically brief lines. ;-)
>
> There might be some insights to be had in the bpf OOM discussion at
>
> https://lkml.kernel.org/r/CAEf4BzafXv-PstSAP6krers=S74ri1+zTB4Y2oT6f+33yznqsA@mail.gmail.com
>
> I didn't completely read through that, but that discussion also seems to
> be about interaction between cgroups and bpd programs.
I have reviewed the discussions.
Given that the OOM might be cgroup-specific, implementing a
cgroup-based BPF-OOM handler makes sense.
--
Regards
Yafang
On 09.10.25 11:59, Yafang Shao wrote:
> On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 15:11, Yafang Shao wrote:
>>> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 08.10.25 13:27, Zi Yan wrote:
>>>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>>>>
>>>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>>>>> + enum tva_type type,
>>>>>>>>>>> + unsigned long orders)
>>>>>>>>>>> +{
>>>>>>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>>>>> + int bpf_order;
>>>>>>>>>>> +
>>>>>>>>>>> + /* No BPF program is attached */
>>>>>>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>>>>> + &transparent_hugepage_flags))
>>>>>>>>>>> + return orders;
>>>>>>>>>>> +
>>>>>>>>>>> + rcu_read_lock();
>>>>>>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>>>>> + goto out;
>>>>>>>>>>> +
>>>>>>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>>>>> + orders &= BIT(bpf_order);
>>>>>>>>>>> +
>>>>>>>>>>> +out:
>>>>>>>>>>> + rcu_read_unlock();
>>>>>>>>>>> + return orders;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> I thought I explained it earlier.
>>>>>>>>>> Nack to a single global prog approach.
>>>>>>>>>
>>>>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>>>>> or more refined for cgroups/processes.
>>>>>>>>>
>>>>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>>>>> own policy: I can see use cases for that.
>>>>>>>>>
>>>>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>>>>
>>>>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>>>>> within the mm_struct as follows:
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>>>> +struct bpf_mm_ops {
>>>>>>>> +#ifdef CONFIG_BPF_THP
>>>>>>>> + struct bpf_thp_ops bpf_thp;
>>>>>>>> +#endif
>>>>>>>> +};
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>> /*
>>>>>>>> * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>>>> * mm_flags_xxx() helper functions.
>>>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>>>> #ifdef CONFIG_MM_ID
>>>>>>>> mm_id_t mm_id;
>>>>>>>> #endif /* CONFIG_MM_ID */
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>>>> + struct bpf_mm_ops bpf_mm;
>>>>>>>> +#endif
>>>>>>>> } __randomize_layout;
>>>>>>>>
>>>>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>>>>
>>>>>>> That's what we do on linux-mm :)
>>>>>>>
>>>>>>> It would be great to use Alexei's feedback/experience to come up with
>>>>>>> something that is flexible for various use cases.
>>>>>>
>>>>>> I'm still not entirely convinced that allowing individual processes or
>>>>>> cgroups to run independent progs is a valid use case. However, since
>>>>>> we have a consensus that this is the right direction, I will proceed
>>>>>> with this approach.
>>>>>>
>>>>>>>
>>>>>>> So I think this is likely the right direction.
>>>>>>>
>>>>>>> It would be great to evaluate which scenarios we could unlock with this
>>>>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>>>>> extensive/involved the changes will be.
>>>>>>
>>>>>> 1. Global Approach
>>>>>> - Pros:
>>>>>> Simple;
>>>>>> Can manage different THP policies for different cgroups or processes.
>>>>>> - Cons:
>>>>>> Does not allow individual processes to run their own BPF programs.
>>>>>>
>>>>>> 2. Per-Process Approach
>>>>>> - Pros:
>>>>>> Enables each process to run its own BPF program.
>>>>>> - Cons:
>>>>>> Introduces significant complexity, as it requires handling the
>>>>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>>>>> every mm_struct.
>>>>>>
>>>>>> 3. Per-Cgroup Approach
>>>>>> - Pros:
>>>>>> Allows individual cgroups to run their own BPF programs.
>>>>>> Less complex than the per-process model, as it can leverage the
>>>>>> existing cgroup operations structure.
>>>>>> - Cons:
>>>>>> Creates a dependency on the cgroup subsystem.
>>>>>> might not be easy to control at the per-process level.
>>>>>
>>>>> Another issue is that how and who to deal with hierarchical cgroup, where one
>>>>> cgroup is a parent of another. Should bpf program to do that or mm code
>>>>> to do that? I remember hierarchical cgroup is the main reason THP control
>>>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>>>>> get the same rejection from cgroup folks?
>>>>
>>>> Valid point.
>>>>
>>>> I do wonder if that problem was already encountered elsewhere with bpf
>>>> and if there is already a solution.
>>>
>>> Our standard is to run only one instance of a BPF program type
>>> system-wide to avoid conflicts. For example, we can't have both
>>> systemd and a container runtime running bpf-thp simultaneously.
>>
>> Right, it's a good question how to combine policies, or "who wins".
>
> From my perspective, the ideal approach is to have one BPF-THP
> instance per mm_struct. This allows for separate managers in different
> domains, such as systemd managing BPF-THP for system processes and
> containerd for container processes, while ensuring that any single
> process is managed by only one BPF-THP.
I came to the same conclusion. At least it's a valid start.
Maybe we would later want a global fallback BPF-THP prog if none was
enabled for a specific MM.
But I would expect to start with a per MM way of doing it, it gives you
way more flexibility in the long run.
--
Cheers
David / dhildenb
On Fri, Oct 10, 2025 at 3:54 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 09.10.25 11:59, Yafang Shao wrote:
> > On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 15:11, Yafang Shao wrote:
> >>> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 08.10.25 13:27, Zi Yan wrote:
> >>>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >>>>>
> >>>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>>>>>> + enum tva_type type,
> >>>>>>>>>>> + unsigned long orders)
> >>>>>>>>>>> +{
> >>>>>>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>>>>>> + int bpf_order;
> >>>>>>>>>>> +
> >>>>>>>>>>> + /* No BPF program is attached */
> >>>>>>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>>>>>> + &transparent_hugepage_flags))
> >>>>>>>>>>> + return orders;
> >>>>>>>>>>> +
> >>>>>>>>>>> + rcu_read_lock();
> >>>>>>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>>>>>> + goto out;
> >>>>>>>>>>> +
> >>>>>>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>>>>>> + orders &= BIT(bpf_order);
> >>>>>>>>>>> +
> >>>>>>>>>>> +out:
> >>>>>>>>>>> + rcu_read_unlock();
> >>>>>>>>>>> + return orders;
> >>>>>>>>>>> +}
> >>>>>>>>>>
> >>>>>>>>>> I thought I explained it earlier.
> >>>>>>>>>> Nack to a single global prog approach.
> >>>>>>>>>
> >>>>>>>>> I agree. We should have the option to either specify a policy globally,
> >>>>>>>>> or more refined for cgroups/processes.
> >>>>>>>>>
> >>>>>>>>> It's an interesting question if a program would ever want to ship its
> >>>>>>>>> own policy: I can see use cases for that.
> >>>>>>>>>
> >>>>>>>>> So I agree that we should make it more flexible right from the start.
> >>>>>>>>
> >>>>>>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>>>>>> within the mm_struct as follows:
> >>>>>>>>
> >>>>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>>>> +struct bpf_mm_ops {
> >>>>>>>> +#ifdef CONFIG_BPF_THP
> >>>>>>>> + struct bpf_thp_ops bpf_thp;
> >>>>>>>> +#endif
> >>>>>>>> +};
> >>>>>>>> +#endif
> >>>>>>>> +
> >>>>>>>> /*
> >>>>>>>> * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>>>>> * mm_flags_xxx() helper functions.
> >>>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>>>>> #ifdef CONFIG_MM_ID
> >>>>>>>> mm_id_t mm_id;
> >>>>>>>> #endif /* CONFIG_MM_ID */
> >>>>>>>> +
> >>>>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>>>> + struct bpf_mm_ops bpf_mm;
> >>>>>>>> +#endif
> >>>>>>>> } __randomize_layout;
> >>>>>>>>
> >>>>>>>> We should be aware that this will involve extensive changes in mm/.
> >>>>>>>
> >>>>>>> That's what we do on linux-mm :)
> >>>>>>>
> >>>>>>> It would be great to use Alexei's feedback/experience to come up with
> >>>>>>> something that is flexible for various use cases.
> >>>>>>
> >>>>>> I'm still not entirely convinced that allowing individual processes or
> >>>>>> cgroups to run independent progs is a valid use case. However, since
> >>>>>> we have a consensus that this is the right direction, I will proceed
> >>>>>> with this approach.
> >>>>>>
> >>>>>>>
> >>>>>>> So I think this is likely the right direction.
> >>>>>>>
> >>>>>>> It would be great to evaluate which scenarios we could unlock with this
> >>>>>>> (global vs. per-process vs. per-cgroup) approach, and how
> >>>>>>> extensive/involved the changes will be.
> >>>>>>
> >>>>>> 1. Global Approach
> >>>>>> - Pros:
> >>>>>> Simple;
> >>>>>> Can manage different THP policies for different cgroups or processes.
> >>>>>> - Cons:
> >>>>>> Does not allow individual processes to run their own BPF programs.
> >>>>>>
> >>>>>> 2. Per-Process Approach
> >>>>>> - Pros:
> >>>>>> Enables each process to run its own BPF program.
> >>>>>> - Cons:
> >>>>>> Introduces significant complexity, as it requires handling the
> >>>>>> BPF program's lifecycle (creation, destruction, inheritance) within
> >>>>>> every mm_struct.
> >>>>>>
> >>>>>> 3. Per-Cgroup Approach
> >>>>>> - Pros:
> >>>>>> Allows individual cgroups to run their own BPF programs.
> >>>>>> Less complex than the per-process model, as it can leverage the
> >>>>>> existing cgroup operations structure.
> >>>>>> - Cons:
> >>>>>> Creates a dependency on the cgroup subsystem.
> >>>>>> might not be easy to control at the per-process level.
> >>>>>
> >>>>> Another issue is that how and who to deal with hierarchical cgroup, where one
> >>>>> cgroup is a parent of another. Should bpf program to do that or mm code
> >>>>> to do that? I remember hierarchical cgroup is the main reason THP control
> >>>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> >>>>> get the same rejection from cgroup folks?
> >>>>
> >>>> Valid point.
> >>>>
> >>>> I do wonder if that problem was already encountered elsewhere with bpf
> >>>> and if there is already a solution.
> >>>
> >>> Our standard is to run only one instance of a BPF program type
> >>> system-wide to avoid conflicts. For example, we can't have both
> >>> systemd and a container runtime running bpf-thp simultaneously.
> >>
> >> Right, it's a good question how to combine policies, or "who wins".
> >
> > From my perspective, the ideal approach is to have one BPF-THP
> > instance per mm_struct. This allows for separate managers in different
> > domains, such as systemd managing BPF-THP for system processes and
> > containerd for container processes, while ensuring that any single
> > process is managed by only one BPF-THP.
>
> I came to the same conclusion. At least it's a valid start.
>
> Maybe we would later want a global fallback BPF-THP prog if none was
> enabled for a specific MM.
good idea. We can fallback to the global model when attaching pid 1.
>
> But I would expect to start with a per MM way of doing it, it gives you
> way more flexibility in the long run.
THP, such as shmem and file-backed THP, are shareable across multiple
processes and cgroups. If we allow different BPF-THP policies to be
applied to these shared resources, it could lead to policy
inconsistencies. This would ultimately recreate a long-standing issue
in memcg, which still lacks a robust solution for this problem [0].
This suggests that applying SCOPED policies to SHAREABLE memory may be
fundamentally flawed ;-)
[0]. https://lore.kernel.org/linux-mm/YwNold0GMOappUxc@slm.duckdns.org/
(Added the maintainers from the old discussion to this thread.)
--
Regards
Yafang
>> I came to the same conclusion. At least it's a valid start. >> >> Maybe we would later want a global fallback BPF-THP prog if none was >> enabled for a specific MM. > > good idea. We can fallback to the global model when attaching pid 1. > >> >> But I would expect to start with a per MM way of doing it, it gives you >> way more flexibility in the long run. > > THP, such as shmem and file-backed THP, are shareable across multiple > processes and cgroups. If we allow different BPF-THP policies to be > applied to these shared resources, it could lead to policy > inconsistencies. Sure, but nothing new about that (e.g., VM_HUGEPAGE, VM_NOHUGEPAGE, PR_GET_THP_DISABLE). I'd expect that we focus on anon THP as the first step either way. Skimming over this series, anon memory seems to be the main focus. > This would ultimately recreate a long-standing issue > in memcg, which still lacks a robust solution for this problem [0]. > > This suggests that applying SCOPED policies to SHAREABLE memory may be > fundamentally flawed ;-) Yeah, shared memory is usually more tricky: see mempolicy handling for shmem. There, the policy is much rather glued to a file than to a process. -- Cheers David / dhildenb
On Mon, Oct 13, 2025 at 8:42 PM David Hildenbrand <david@redhat.com> wrote: > > >> I came to the same conclusion. At least it's a valid start. > >> > >> Maybe we would later want a global fallback BPF-THP prog if none was > >> enabled for a specific MM. > > > > good idea. We can fallback to the global model when attaching pid 1. > > > >> > >> But I would expect to start with a per MM way of doing it, it gives you > >> way more flexibility in the long run. > > > > THP, such as shmem and file-backed THP, are shareable across multiple > > processes and cgroups. If we allow different BPF-THP policies to be > > applied to these shared resources, it could lead to policy > > inconsistencies. > > Sure, but nothing new about that (e.g., VM_HUGEPAGE, VM_NOHUGEPAGE, > PR_GET_THP_DISABLE). > > I'd expect that we focus on anon THP as the first step either way. > > Skimming over this series, anon memory seems to be the main focus. Right, currently it is focusing on anon memory. In the next step it will be extended to file-backed THP. > > > This would ultimately recreate a long-standing issue > > in memcg, which still lacks a robust solution for this problem [0]. > > > > This suggests that applying SCOPED policies to SHAREABLE memory may be > > fundamentally flawed ;-) > > Yeah, shared memory is usually more tricky: see mempolicy handling for > shmem. There, the policy is much rather glued to a file than to a process. For shared THP we are planning to apply the THP policy based on vma->vm_file. Consequently, the existing BPF-THP policies, which are scoped to a process or cgroup, are incompatible with shared THP. This raises the question of how to effectively scope policies for shared memory. While one option is to key the policy to the file structure, this may not be ideal as it could lead to considerable implementation and maintenance challenges... -- Regards Yafang
On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>
> > On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 10:18, Yafang Shao wrote:
> >>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>
> >>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>> + enum tva_type type,
> >>>>>> + unsigned long orders)
> >>>>>> +{
> >>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>> + int bpf_order;
> >>>>>> +
> >>>>>> + /* No BPF program is attached */
> >>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>> + &transparent_hugepage_flags))
> >>>>>> + return orders;
> >>>>>> +
> >>>>>> + rcu_read_lock();
> >>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>> + goto out;
> >>>>>> +
> >>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>> + orders &= BIT(bpf_order);
> >>>>>> +
> >>>>>> +out:
> >>>>>> + rcu_read_unlock();
> >>>>>> + return orders;
> >>>>>> +}
> >>>>>
> >>>>> I thought I explained it earlier.
> >>>>> Nack to a single global prog approach.
> >>>>
> >>>> I agree. We should have the option to either specify a policy globally,
> >>>> or more refined for cgroups/processes.
> >>>>
> >>>> It's an interesting question if a program would ever want to ship its
> >>>> own policy: I can see use cases for that.
> >>>>
> >>>> So I agree that we should make it more flexible right from the start.
> >>>
> >>> To achieve per-process granularity, the struct-ops must be embedded
> >>> within the mm_struct as follows:
> >>>
> >>> +#ifdef CONFIG_BPF_MM
> >>> +struct bpf_mm_ops {
> >>> +#ifdef CONFIG_BPF_THP
> >>> + struct bpf_thp_ops bpf_thp;
> >>> +#endif
> >>> +};
> >>> +#endif
> >>> +
> >>> /*
> >>> * Opaque type representing current mm_struct flag state. Must be accessed via
> >>> * mm_flags_xxx() helper functions.
> >>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>> #ifdef CONFIG_MM_ID
> >>> mm_id_t mm_id;
> >>> #endif /* CONFIG_MM_ID */
> >>> +
> >>> +#ifdef CONFIG_BPF_MM
> >>> + struct bpf_mm_ops bpf_mm;
> >>> +#endif
> >>> } __randomize_layout;
> >>>
> >>> We should be aware that this will involve extensive changes in mm/.
> >>
> >> That's what we do on linux-mm :)
> >>
> >> It would be great to use Alexei's feedback/experience to come up with
> >> something that is flexible for various use cases.
> >
> > I'm still not entirely convinced that allowing individual processes or
> > cgroups to run independent progs is a valid use case. However, since
> > we have a consensus that this is the right direction, I will proceed
> > with this approach.
> >
> >>
> >> So I think this is likely the right direction.
> >>
> >> It would be great to evaluate which scenarios we could unlock with this
> >> (global vs. per-process vs. per-cgroup) approach, and how
> >> extensive/involved the changes will be.
> >
> > 1. Global Approach
> > - Pros:
> > Simple;
> > Can manage different THP policies for different cgroups or processes.
> > - Cons:
> > Does not allow individual processes to run their own BPF programs.
> >
> > 2. Per-Process Approach
> > - Pros:
> > Enables each process to run its own BPF program.
> > - Cons:
> > Introduces significant complexity, as it requires handling the
> > BPF program's lifecycle (creation, destruction, inheritance) within
> > every mm_struct.
> >
> > 3. Per-Cgroup Approach
> > - Pros:
> > Allows individual cgroups to run their own BPF programs.
> > Less complex than the per-process model, as it can leverage the
> > existing cgroup operations structure.
> > - Cons:
> > Creates a dependency on the cgroup subsystem.
> > might not be easy to control at the per-process level.
>
> Another issue is that how and who to deal with hierarchical cgroup, where one
> cgroup is a parent of another. Should bpf program to do that or mm code
> to do that?
The cgroup subsystem handles this propagation automatically. When a
BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
automatically inherited by all descendant cgroups.
Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
requiring us to build new attachment mechanisms for cgroup-based
struct-ops.
> I remember hierarchical cgroup is the main reason THP control
> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> get the same rejection from cgroup folks?
Right, it was rejected by the cgroup maintainers [0]
[0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
--
Regards
Yafang
Hi,
On 10/8/2025 3:06 PM, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>
>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>
>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>> + enum tva_type type,
>>>>>>>> + unsigned long orders)
>>>>>>>> +{
>>>>>>>> + thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>> + int bpf_order;
>>>>>>>> +
>>>>>>>> + /* No BPF program is attached */
>>>>>>>> + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>> + &transparent_hugepage_flags))
>>>>>>>> + return orders;
>>>>>>>> +
>>>>>>>> + rcu_read_lock();
>>>>>>>> + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>> + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>> + goto out;
>>>>>>>> +
>>>>>>>> + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>> + orders &= BIT(bpf_order);
>>>>>>>> +
>>>>>>>> +out:
>>>>>>>> + rcu_read_unlock();
>>>>>>>> + return orders;
>>>>>>>> +}
>>>>>>>
>>>>>>> I thought I explained it earlier.
>>>>>>> Nack to a single global prog approach.
>>>>>>
>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>> or more refined for cgroups/processes.
>>>>>>
>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>> own policy: I can see use cases for that.
>>>>>>
>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>
>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>> within the mm_struct as follows:
>>>>>
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +struct bpf_mm_ops {
>>>>> +#ifdef CONFIG_BPF_THP
>>>>> + struct bpf_thp_ops bpf_thp;
>>>>> +#endif
>>>>> +};
>>>>> +#endif
>>>>> +
>>>>> /*
>>>>> * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>> * mm_flags_xxx() helper functions.
>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>> #ifdef CONFIG_MM_ID
>>>>> mm_id_t mm_id;
>>>>> #endif /* CONFIG_MM_ID */
>>>>> +
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> + struct bpf_mm_ops bpf_mm;
>>>>> +#endif
>>>>> } __randomize_layout;
>>>>>
>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>
>>>> That's what we do on linux-mm :)
>>>>
>>>> It would be great to use Alexei's feedback/experience to come up with
>>>> something that is flexible for various use cases.
>>>
>>> I'm still not entirely convinced that allowing individual processes or
>>> cgroups to run independent progs is a valid use case. However, since
>>> we have a consensus that this is the right direction, I will proceed
>>> with this approach.
>>>
>>>>
>>>> So I think this is likely the right direction.
>>>>
>>>> It would be great to evaluate which scenarios we could unlock with this
>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>> extensive/involved the changes will be.
>>>
>>> 1. Global Approach
>>> - Pros:
>>> Simple;
>>> Can manage different THP policies for different cgroups or processes.
>>> - Cons:
>>> Does not allow individual processes to run their own BPF programs.
>>>
>>> 2. Per-Process Approach
>>> - Pros:
>>> Enables each process to run its own BPF program.
>>> - Cons:
>>> Introduces significant complexity, as it requires handling the
>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>> every mm_struct.
>>>
>>> 3. Per-Cgroup Approach
>>> - Pros:
>>> Allows individual cgroups to run their own BPF programs.
>>> Less complex than the per-process model, as it can leverage the
>>> existing cgroup operations structure.
>>> - Cons:
>>> Creates a dependency on the cgroup subsystem.
>>> might not be easy to control at the per-process level.
>>
>> Another issue is that how and who to deal with hierarchical cgroup, where one
>> cgroup is a parent of another. Should bpf program to do that or mm code
>> to do that?
>
> The cgroup subsystem handles this propagation automatically. When a
> BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
> automatically inherited by all descendant cgroups.
>
> Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
> requiring us to build new attachment mechanisms for cgroup-based
> struct-ops.
>
>> I remember hierarchical cgroup is the main reason THP control
>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>> get the same rejection from cgroup folks?
>
> Right, it was rejected by the cgroup maintainers [0]
>
> [0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
>
Yes, the patch was rejected because:
1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and
tune those processes
3. We set a precedent for other people adding new flags to cgroup and
potentially polluting cgroups. We may end up with cgroups having tens of
different flags, making sysadmin's job more complex
In the MM call I proposed a new mechanism based on limits, something like
hugetlbfs.
The main issue, still, is that the sysadmins need to set those up, making
their life more complex.
I remember few participants mentioned the idea of the kernel setting huge page
consumption automatically using some sort of heuristics. To be honest, I
haven't have the time to sit and think about it. I would be glad to cooperate
and come up with a feasible solution.
--
Asier Gutierrez
Huawei
On Fri, Oct 3, 2025 at 10:18 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > + enum tva_type type,
> > + unsigned long orders)
> > +{
> > + thp_order_fn_t *bpf_hook_thp_get_order;
> > + int bpf_order;
> > +
> > + /* No BPF program is attached */
> > + if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > + &transparent_hugepage_flags))
> > + return orders;
> > +
> > + rcu_read_lock();
> > + bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > + if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> > + goto out;
> > +
> > + bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> > + orders &= BIT(bpf_order);
> > +
> > +out:
> > + rcu_read_unlock();
> > + return orders;
> > +}
>
Hello Alexei,
My apologies for the slow reply. I'm on a family vacation and am
checking email intermittently.
> I thought I explained it earlier.
I recall your earlier suggestion for a cgroup-based approach for
BPF-THP. However, as I mentioned, I believe cgroups might not be the
best fit[0]. My understanding was that we had agreed to move away from
that model. Could we realign on this?
[0]. https://lwn.net/ml/all/CALOAHbBvwT+6f_4gBHzPc9n_SukhAs_sa5yX=AjHYsWic1MRuw@mail.gmail.com/
> Nack to a single global prog approach.
The design of BPF-THP as a global program is a direct consequence of
its purpose: to extend the existing global
/sys/kernel/mm/transparent_hugepage/ interface. This architectural
consistency simplifies both understanding and maintenance.
Crucially, this global nature does not limit policy control. The
program is designed with the flexibility to enforce policies at
multiple levels—globally, per-cgroup, or per-task—enabling all of our
target use cases through a unified mechanism.
>
> The logic must accommodate multiple programs per-container
> or any other way from the beginning.
> If cgroup based scoping doesn't fit use per process tree scoping.
During the initial design of BPF-THP, I evaluated whether a global
program or a per-process program would be more suitable. While a
per-process design would require embedding a struct_ops into
task_struct, this seemed like over-engineering to me. We can
efficiently implement both cgroup-tree-scoped and process-tree-scoped
THP policies using existing BPF helpers, such as:
SCOPING BPF kfuncs
cgroup tree -> bpf_task_under_cgroup()
process tree -> bpf_task_is_ ancestors()
With these kfuncs, there is no need to attach individual BPF-THP
programs to every process or cgroup tree. I have not identified a
valid use case that necessitates embedding a struct_ops in task_struct
which can't be achieved more simply with these kfuncs. If such use
cases exist, please detail them. Consequently, I proceeded with a
global struct_ops implementation.
The desire to attach multiple BPF-THP programs simultaneously does not
appear to be a valid use case. Furthermore, our production experience
has shown that multiple attachments often introduce conflicts. This is
precisely why system administrators prefer to manage BPF programs with
a single manager—to avoid undefined behaviors from competing programs.
Focusing specifically on BPF-THP, the semantics of the program make
multiple attachments unsuitable. A BPF-THP program's outcome is its
return value (a suggested THP order), not the side effects of its
execution. In other words, it is functionally a variant of fmod_ret.
If we allow multiple attachments and they return different values, how
do we resolve the conflict?
If one program returns order-9 and another returns order-1, which
value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
appropriate. The only logical solution is to reject subsequent
attachments and explicitly notify the user of the conflict. Our goal
should be to prevent conflicts from the outset, rather than forcing
developers to create another userspace manager to handle them.
A single global program is a natural and logical extension of the
existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
a good fit for BPF-THP and avoids unnecessary complexity.
Please provide a detailed clarification if I have misunderstood your position.
--
Regards
Yafang
On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote: > has shown that multiple attachments often introduce conflicts. This is > precisely why system administrators prefer to manage BPF programs with > a single manager—to avoid undefined behaviors from competing programs. I don't believe this a single bit. bpf-thp didn't have any production exposure. Everything that you said above is wishful thinking. In actual production every programmable component needs to be scoped in some way. One can argue that scheduling is a global property too, yet sched-ext only works on a specific scheduling class. All bpf program types are scoped except tracing, since kprobe/fentry are global by definition, and even than multiple tracing programs can be attached to the same kprobe. > execution. In other words, it is functionally a variant of fmod_ret. hid-bpf initially went with fmod_ret approach, deleted the whole thing and redesigned it with _scoped_ struct-ops. > If we allow multiple attachments and they return different values, how > do we resolve the conflict? > > If one program returns order-9 and another returns order-1, which > value should be chosen? Neither 1, 9, nor a combination (1 & 9) is > appropriate. No. If you cannot figure out how to stack multiple programs it means that the api you picked is broken. > A single global program is a natural and logical extension of the > existing global /sys/kernel/mm/transparent_hugepage/ interface. It is > a good fit for BPF-THP and avoids unnecessary complexity. The Nack to single global prog is not negotiable.
On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > has shown that multiple attachments often introduce conflicts. This is > > precisely why system administrators prefer to manage BPF programs with > > a single manager—to avoid undefined behaviors from competing programs. > > I don't believe this a single bit. You should spend some time seeing how users are actually applying BPF in practice. Some information for you : https://github.com/bpfman/bpfman https://github.com/DataDog/ebpf-manager https://github.com/ccfos/huatuo > bpf-thp didn't have any > production exposure. > Everything that you said above is wishful thinking. The statement above applies to other multi-attachable programs, not to bpf-thp. > In actual production every programmable component needs to be > scoped in some way. One can argue that scheduling is a global > property too, yet sched-ext only works on a specific scheduling class. I can also argue that bpf-thp only works on a specific thp mode (madvise and always) ;-) > All bpf program types are scoped except tracing, since kprobe/fentry > are global by definition, and even than multiple tracing programs > can be attached to the same kprobe. > > > execution. In other words, it is functionally a variant of fmod_ret. > > hid-bpf initially went with fmod_ret approach, deleted the whole thing > and redesigned it with _scoped_ struct-ops. I see little value in embedding a bpf_thp_struct_ops into the task_struct. The benefits don't appear to justify the added complexity. > > > If we allow multiple attachments and they return different values, how > > do we resolve the conflict? > > > > If one program returns order-9 and another returns order-1, which > > value should be chosen? Neither 1, 9, nor a combination (1 & 9) is > > appropriate. > > No. If you cannot figure out how to stack multiple programs > it means that the api you picked is broken. > > > A single global program is a natural and logical extension of the > > existing global /sys/kernel/mm/transparent_hugepage/ interface. It is > > a good fit for BPF-THP and avoids unnecessary complexity. > > The Nack to single global prog is not negotiable. We still lack a compelling technical reason for embedding bpf_thp_struct_ops into task_struct. Can you clearly articulate the problem that this specific design is solving? -- Regards Yafang
On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > has shown that multiple attachments often introduce conflicts. This is > > > precisely why system administrators prefer to manage BPF programs with > > > a single manager—to avoid undefined behaviors from competing programs. > > > > I don't believe this a single bit. > > You should spend some time seeing how users are actually applying BPF > in practice. Some information for you : > > https://github.com/bpfman/bpfman > https://github.com/DataDog/ebpf-manager > https://github.com/ccfos/huatuo By seeing the above you learned the wrong lesson. These orchestrators and many others were created because we made mistakes in the kernel by not scoping the progs enough. XDP is a prime example. It allows one program per netdev. This was a massive mistake which we're still trying to fix. > > hid-bpf initially went with fmod_ret approach, deleted the whole thing > > and redesigned it with _scoped_ struct-ops. > > I see little value in embedding a bpf_thp_struct_ops into the > task_struct. The benefits don't appear to justify the added > complexity. huh? where did I say that struct-ops should be embedded in task_struct ?
On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > has shown that multiple attachments often introduce conflicts. This is > > > > precisely why system administrators prefer to manage BPF programs with > > > > a single manager—to avoid undefined behaviors from competing programs. > > > > > > I don't believe this a single bit. > > > > You should spend some time seeing how users are actually applying BPF > > in practice. Some information for you : > > > > https://github.com/bpfman/bpfman > > https://github.com/DataDog/ebpf-manager > > https://github.com/ccfos/huatuo > > By seeing the above you learned the wrong lesson. > These orchestrators and many others were created because > we made mistakes in the kernel by not scoping the progs enough. > XDP is a prime example. It allows one program per netdev. > This was a massive mistake which we're still trying to fix. Since we don't use XDP in production, I can't comment on it. However, for our multi-attachable cgroup BPF programs, a key issue arises: if a program has permission to attach to one cgroup, it can attach to any cgroup. While scoping enables attachment to individual cgroups, it does not enforce isolation. This means we must still check for conflicts between programs, which begs the question: what is the functional purpose of this scoping mechanism? > > > > hid-bpf initially went with fmod_ret approach, deleted the whole thing > > > and redesigned it with _scoped_ struct-ops. > > > > I see little value in embedding a bpf_thp_struct_ops into the > > task_struct. The benefits don't appear to justify the added > > complexity. > > huh? where did I say that struct-ops should be embedded in task_struct ? Given that, what would you propose? My position is that the only valid scope for bpf-thp is at the level of specific THP modes like madvise and always. This patch correctly implements that precise design. -- Regards Yafang
On Tue, Oct 7, 2025 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > has shown that multiple attachments often introduce conflicts. This is > > > > > precisely why system administrators prefer to manage BPF programs with > > > > > a single manager—to avoid undefined behaviors from competing programs. > > > > > > > > I don't believe this a single bit. > > > > > > You should spend some time seeing how users are actually applying BPF > > > in practice. Some information for you : > > > > > > https://github.com/bpfman/bpfman > > > https://github.com/DataDog/ebpf-manager > > > https://github.com/ccfos/huatuo > > > > By seeing the above you learned the wrong lesson. > > These orchestrators and many others were created because > > we made mistakes in the kernel by not scoping the progs enough. > > XDP is a prime example. It allows one program per netdev. > > This was a massive mistake which we're still trying to fix. > > Since we don't use XDP in production, I can't comment on it. However, > for our multi-attachable cgroup BPF programs, a key issue arises: if a > program has permission to attach to one cgroup, it can attach to any > cgroup. While scoping enables attachment to individual cgroups, it > does not enforce isolation. This means we must still check for > conflicts between programs, which begs the question: what is the > functional purpose of this scoping mechanism? cgroup mprog was added to remove the need for an orchestrator. > My position is that the only valid scope for bpf-thp is at the level > of specific THP modes like madvise and always. This patch correctly > implements that precise design. I'm done with this thread. Nacked-by: Alexei Starovoitov <ast@kernel.org>
On Wed, Oct 8, 2025 at 12:39 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Oct 7, 2025 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > has shown that multiple attachments often introduce conflicts. This is > > > > > > precisely why system administrators prefer to manage BPF programs with > > > > > > a single manager—to avoid undefined behaviors from competing programs. > > > > > > > > > > I don't believe this a single bit. > > > > > > > > You should spend some time seeing how users are actually applying BPF > > > > in practice. Some information for you : > > > > > > > > https://github.com/bpfman/bpfman > > > > https://github.com/DataDog/ebpf-manager > > > > https://github.com/ccfos/huatuo > > > > > > By seeing the above you learned the wrong lesson. > > > These orchestrators and many others were created because > > > we made mistakes in the kernel by not scoping the progs enough. > > > XDP is a prime example. It allows one program per netdev. > > > This was a massive mistake which we're still trying to fix. > > > > Since we don't use XDP in production, I can't comment on it. However, > > for our multi-attachable cgroup BPF programs, a key issue arises: if a > > program has permission to attach to one cgroup, it can attach to any > > cgroup. While scoping enables attachment to individual cgroups, it > > does not enforce isolation. This means we must still check for > > conflicts between programs, which begs the question: what is the > > functional purpose of this scoping mechanism? > > cgroup mprog was added to remove the need for an orchestrator. However, this approach would still require a userspace manager to coordinate the mprog attachments and prevent conflicts between different programs, no ? > > > My position is that the only valid scope for bpf-thp is at the level > > of specific THP modes like madvise and always. This patch correctly > > implements that precise design. > > I'm done with this thread. > > Nacked-by: Alexei Starovoitov <ast@kernel.org> Given its experimental status, I believe any scoping mechanism would be premature and over-engineered. Even integrating it into the mm_struct introduces unnecessary complexity at this stage. -- Regards Yafang
© 2016 - 2026 Red Hat, Inc.