[v9] mm, bpf: BPF based THP order selection

[PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months, 1 week ago

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, swap or
  other paths.
- VMA's memory advice settings
  MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
  PSI system data or associated cgroup PSI metrics

The kernel API of this new BPF hook is as follows,

/**
 * thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
 * @vma: vm_area_struct associated with the THP allocation
 * @type: TVA type for current @vma
 * @orders: Bitmask of available THP orders for this allocation
 *
 * Return: The suggested THP order for allocation from the BPF program. Must be
 *         a valid, available order.
 */
typedef int thp_order_fn_t(struct vm_area_struct *vma,
			   enum tva_type type,
			   unsigned long orders);

Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.

This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.

This BPF hook enables the implementation of flexible THP allocation
policies at the system, per-cgroup, or per-task level.

This feature requires CONFIG_BPF_THP (EXPERIMENTAL) to be enabled. Note
that this capability is currently unstable and may undergo significant
changes—including potential removal—in future kernel versions.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
---
 MAINTAINERS             |   1 +
 include/linux/huge_mm.h |  23 +++++
 mm/Kconfig              |  11 +++
 mm/Makefile             |   1 +
 mm/huge_memory_bpf.c    | 204 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 240 insertions(+)
 create mode 100644 mm/huge_memory_bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ca8e3d18eedd..7be34b2a64fd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16257,6 +16257,7 @@ F:	include/linux/huge_mm.h
 F:	include/linux/khugepaged.h
 F:	include/trace/events/huge_memory.h
 F:	mm/huge_memory.c
+F:	mm/huge_memory_bpf.c
 F:	mm/khugepaged.c
 F:	mm/mm_slot.h
 F:	tools/testing/selftests/mm/khugepaged.c
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a635dcbb2b99..02055cc93bfe 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
 };
 
 struct kobject;
@@ -269,6 +270,23 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 enum tva_type type,
 					 unsigned long orders);
 
+#ifdef CONFIG_BPF_THP
+
+unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+			unsigned long orders);
+
+#else
+
+static inline unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, enum tva_type type,
+			unsigned long orders)
+{
+	return orders;
+}
+
+#endif
+
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check
@@ -290,6 +308,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 {
 	vm_flags_t vm_flags = vma->vm_flags;
 
+	/* The BPF-specified order overrides which order is selected. */
+	orders &= bpf_hook_thp_get_orders(vma, type, orders);
+	if (!orders)
+		return 0;
+
 	/*
 	 * Optimization to check if required orders are enabled early. Only
 	 * forced collapse ignores sysfs configs.
diff --git a/mm/Kconfig b/mm/Kconfig
index bde9f842a4a8..ffbcc5febb10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -895,6 +895,17 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config BPF_THP
+	bool "BPF-based THP Policy (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+	help
+	  Enable dynamic THP policy adjustment using BPF programs. This feature
+	  is currently experimental.
+
+	  WARNING: This feature is unstable and may change in future kernel
+	  versions.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..4efca1c8a919 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_BPF_THP) += huge_memory_bpf.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
new file mode 100644
index 000000000000..47c124d588b2
--- /dev/null
+++ b/mm/huge_memory_bpf.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF-based THP policy management
+ *
+ * Author: Yafang Shao <laoar.shao@gmail.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+/**
+ * @thp_order_fn_t: Get the suggested THP order from a BPF program for allocation
+ * @vma: vm_area_struct associated with the THP allocation
+ * @type: TVA type for current @vma
+ * @orders: Bitmask of available THP orders for this allocation
+ *
+ * Return: The suggested THP order for allocation from the BPF program. Must be
+ *         a valid, available order.
+ */
+typedef int thp_order_fn_t(struct vm_area_struct *vma,
+			   enum tva_type type,
+			   unsigned long orders);
+
+struct bpf_thp_ops {
+	thp_order_fn_t __rcu *thp_get_order;
+};
+
+static struct bpf_thp_ops bpf_thp;
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
+				      enum tva_type type,
+				      unsigned long orders)
+{
+	thp_order_fn_t *bpf_hook_thp_get_order;
+	int bpf_order;
+
+	/* No BPF program is attached */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags))
+		return orders;
+
+	rcu_read_lock();
+	bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
+	if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
+		goto out;
+
+	bpf_order = bpf_hook_thp_get_order(vma, type, orders);
+	orders &= BIT(bpf_order);
+
+out:
+	rcu_read_unlock();
+	return orders;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+	.get_func_proto = bpf_thp_get_func_proto,
+	.is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_thp_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	/* The call site operates under RCU protection. */
+	if (prog->sleepable)
+		return -EINVAL;
+	return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	spin_lock(&thp_ops_lock);
+	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+			     &transparent_hugepage_flags)) {
+		spin_unlock(&thp_ops_lock);
+		return -EBUSY;
+	}
+	WARN_ON_ONCE(rcu_access_pointer(bpf_thp.thp_get_order));
+	rcu_assign_pointer(bpf_thp.thp_get_order, ops->thp_get_order);
+	spin_unlock(&thp_ops_lock);
+	return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+	thp_order_fn_t *old_fn;
+
+	spin_lock(&thp_ops_lock);
+	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, NULL,
+				     lockdep_is_held(&thp_ops_lock));
+	WARN_ON_ONCE(!old_fn);
+	spin_unlock(&thp_ops_lock);
+
+	synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	thp_order_fn_t *old_fn, *new_fn;
+	struct bpf_thp_ops *old = old_kdata;
+	struct bpf_thp_ops *ops = kdata;
+	int ret = 0;
+
+	if (!ops || !old)
+		return -EINVAL;
+
+	spin_lock(&thp_ops_lock);
+	/* The prog has aleady been removed. */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	new_fn = rcu_dereference(ops->thp_get_order);
+	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, new_fn,
+				     lockdep_is_held(&thp_ops_lock));
+	WARN_ON_ONCE(!old_fn || !new_fn);
+
+out:
+	spin_unlock(&thp_ops_lock);
+	if (!ret)
+		synchronize_rcu();
+	return ret;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	if (!ops->thp_get_order) {
+		pr_err("bpf_thp: required ops isn't implemented\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int bpf_thp_get_order(struct vm_area_struct *vma,
+			     enum tva_type type,
+			     unsigned long orders)
+{
+	return -1;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+	.thp_get_order = (thp_order_fn_t __rcu *)bpf_thp_get_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+	.verifier_ops = &thp_bpf_verifier_ops,
+	.init = bpf_thp_init,
+	.check_member = bpf_thp_check_member,
+	.init_member = bpf_thp_init_member,
+	.reg = bpf_thp_reg,
+	.unreg = bpf_thp_unreg,
+	.update = bpf_thp_update,
+	.validate = bpf_thp_validate,
+	.cfi_stubs = &__bpf_thp_ops,
+	.owner = THIS_MODULE,
+	.name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+	int err;
+
+	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+	if (err)
+		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_thp_ops_init);
-- 
2.47.3

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Alexei Starovoitov 4 months, 1 week ago

On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> +                                     enum tva_type type,
> +                                     unsigned long orders)
> +{
> +       thp_order_fn_t *bpf_hook_thp_get_order;
> +       int bpf_order;
> +
> +       /* No BPF program is attached */
> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +                     &transparent_hugepage_flags))
> +               return orders;
> +
> +       rcu_read_lock();
> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> +               goto out;
> +
> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> +       orders &= BIT(bpf_order);
> +
> +out:
> +       rcu_read_unlock();
> +       return orders;
> +}

I thought I explained it earlier.
Nack to a single global prog approach.

The logic must accommodate multiple programs per-container
or any other way from the beginning.
If cgroup based scoping doesn't fit use per process tree scoping.

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by David Hildenbrand 4 months ago

On 03.10.25 04:18, Alexei Starovoitov wrote:
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>> +                                     enum tva_type type,
>> +                                     unsigned long orders)
>> +{
>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>> +       int bpf_order;
>> +
>> +       /* No BPF program is attached */
>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>> +                     &transparent_hugepage_flags))
>> +               return orders;
>> +
>> +       rcu_read_lock();
>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>> +               goto out;
>> +
>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>> +       orders &= BIT(bpf_order);
>> +
>> +out:
>> +       rcu_read_unlock();
>> +       return orders;
>> +}
> 
> I thought I explained it earlier.
> Nack to a single global prog approach.

I agree. We should have the option to either specify a policy globally, 
or more refined for cgroups/processes.

It's an interesting question if a program would ever want to ship its 
own policy: I can see use cases for that.

So I agree that we should make it more flexible right from the start.

-- 
Cheers

David / dhildenb

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 03.10.25 04:18, Alexei Starovoitov wrote:
> > On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >> +                                     enum tva_type type,
> >> +                                     unsigned long orders)
> >> +{
> >> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >> +       int bpf_order;
> >> +
> >> +       /* No BPF program is attached */
> >> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >> +                     &transparent_hugepage_flags))
> >> +               return orders;
> >> +
> >> +       rcu_read_lock();
> >> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >> +               goto out;
> >> +
> >> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >> +       orders &= BIT(bpf_order);
> >> +
> >> +out:
> >> +       rcu_read_unlock();
> >> +       return orders;
> >> +}
> >
> > I thought I explained it earlier.
> > Nack to a single global prog approach.
>
> I agree. We should have the option to either specify a policy globally,
> or more refined for cgroups/processes.
>
> It's an interesting question if a program would ever want to ship its
> own policy: I can see use cases for that.
>
> So I agree that we should make it more flexible right from the start.

To achieve per-process granularity, the struct-ops must be embedded
within the mm_struct as follows:

+#ifdef CONFIG_BPF_MM
+struct bpf_mm_ops {
+#ifdef CONFIG_BPF_THP
+       struct bpf_thp_ops bpf_thp;
+#endif
+};
+#endif
+
 /*
  * Opaque type representing current mm_struct flag state. Must be accessed via
  * mm_flags_xxx() helper functions.
@@ -1268,6 +1281,10 @@ struct mm_struct {
 #ifdef CONFIG_MM_ID
                mm_id_t mm_id;
 #endif /* CONFIG_MM_ID */
+
+#ifdef CONFIG_BPF_MM
+               struct bpf_mm_ops bpf_mm;
+#endif
        } __randomize_layout;

We should be aware that this will involve extensive changes in mm/. If
we're aligned on this direction, I'll start working on the patches.

-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by David Hildenbrand 4 months ago

On 08.10.25 10:18, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>
>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>> +                                     enum tva_type type,
>>>> +                                     unsigned long orders)
>>>> +{
>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>> +       int bpf_order;
>>>> +
>>>> +       /* No BPF program is attached */
>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>> +                     &transparent_hugepage_flags))
>>>> +               return orders;
>>>> +
>>>> +       rcu_read_lock();
>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>> +               goto out;
>>>> +
>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>> +       orders &= BIT(bpf_order);
>>>> +
>>>> +out:
>>>> +       rcu_read_unlock();
>>>> +       return orders;
>>>> +}
>>>
>>> I thought I explained it earlier.
>>> Nack to a single global prog approach.
>>
>> I agree. We should have the option to either specify a policy globally,
>> or more refined for cgroups/processes.
>>
>> It's an interesting question if a program would ever want to ship its
>> own policy: I can see use cases for that.
>>
>> So I agree that we should make it more flexible right from the start.
> 
> To achieve per-process granularity, the struct-ops must be embedded
> within the mm_struct as follows:
> 
> +#ifdef CONFIG_BPF_MM
> +struct bpf_mm_ops {
> +#ifdef CONFIG_BPF_THP
> +       struct bpf_thp_ops bpf_thp;
> +#endif
> +};
> +#endif
> +
>   /*
>    * Opaque type representing current mm_struct flag state. Must be accessed via
>    * mm_flags_xxx() helper functions.
> @@ -1268,6 +1281,10 @@ struct mm_struct {
>   #ifdef CONFIG_MM_ID
>                  mm_id_t mm_id;
>   #endif /* CONFIG_MM_ID */
> +
> +#ifdef CONFIG_BPF_MM
> +               struct bpf_mm_ops bpf_mm;
> +#endif
>          } __randomize_layout;
> 
> We should be aware that this will involve extensive changes in mm/.

That's what we do on linux-mm :)

It would be great to use Alexei's feedback/experience to come up with 
something that is flexible for various use cases.

So I think this is likely the right direction.

It would be great to evaluate which scenarios we could unlock with this 
(global vs. per-process vs. per-cgroup) approach, and how 
extensive/involved the changes will be.

If we need a slot in the bi-weekly mm alignment session to brainstorm, 
we can ask Dave R. for one in the upcoming weeks.

-- 
Cheers

David / dhildenb

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 10:18, Yafang Shao wrote:
> > On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>
> >>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>> +                                     enum tva_type type,
> >>>> +                                     unsigned long orders)
> >>>> +{
> >>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>> +       int bpf_order;
> >>>> +
> >>>> +       /* No BPF program is attached */
> >>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>> +                     &transparent_hugepage_flags))
> >>>> +               return orders;
> >>>> +
> >>>> +       rcu_read_lock();
> >>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>> +               goto out;
> >>>> +
> >>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>> +       orders &= BIT(bpf_order);
> >>>> +
> >>>> +out:
> >>>> +       rcu_read_unlock();
> >>>> +       return orders;
> >>>> +}
> >>>
> >>> I thought I explained it earlier.
> >>> Nack to a single global prog approach.
> >>
> >> I agree. We should have the option to either specify a policy globally,
> >> or more refined for cgroups/processes.
> >>
> >> It's an interesting question if a program would ever want to ship its
> >> own policy: I can see use cases for that.
> >>
> >> So I agree that we should make it more flexible right from the start.
> >
> > To achieve per-process granularity, the struct-ops must be embedded
> > within the mm_struct as follows:
> >
> > +#ifdef CONFIG_BPF_MM
> > +struct bpf_mm_ops {
> > +#ifdef CONFIG_BPF_THP
> > +       struct bpf_thp_ops bpf_thp;
> > +#endif
> > +};
> > +#endif
> > +
> >   /*
> >    * Opaque type representing current mm_struct flag state. Must be accessed via
> >    * mm_flags_xxx() helper functions.
> > @@ -1268,6 +1281,10 @@ struct mm_struct {
> >   #ifdef CONFIG_MM_ID
> >                  mm_id_t mm_id;
> >   #endif /* CONFIG_MM_ID */
> > +
> > +#ifdef CONFIG_BPF_MM
> > +               struct bpf_mm_ops bpf_mm;
> > +#endif
> >          } __randomize_layout;
> >
> > We should be aware that this will involve extensive changes in mm/.
>
> That's what we do on linux-mm :)
>
> It would be great to use Alexei's feedback/experience to come up with
> something that is flexible for various use cases.

I'm still not entirely convinced that allowing individual processes or
cgroups to run independent progs is a valid use case. However, since
we have a consensus that this is the right direction, I will proceed
with this approach.

>
> So I think this is likely the right direction.
>
> It would be great to evaluate which scenarios we could unlock with this
> (global vs. per-process vs. per-cgroup) approach, and how
> extensive/involved the changes will be.

1. Global Approach
   - Pros:
     Simple;
     Can manage different THP policies for different cgroups or processes.
  - Cons:
     Does not allow individual processes to run their own BPF programs.

2. Per-Process Approach
    - Pros:
      Enables each process to run its own BPF program.
    - Cons:
      Introduces significant complexity, as it requires handling the
BPF program's lifecycle (creation, destruction, inheritance) within
every mm_struct.

3. Per-Cgroup Approach
    - Pros:
       Allows individual cgroups to run their own BPF programs.
       Less complex than the per-process model, as it can leverage the
existing cgroup operations structure.
    - Cons:
       Creates a dependency on the cgroup subsystem.
       might not be easy to control at the per-process level.

>
> If we need a slot in the bi-weekly mm alignment session to brainstorm,
> we can ask Dave R. for one in the upcoming weeks.

I will draft an RFC to outline the required changes in both the mm/
and bpf/ subsystems and solicit feedback.

--
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Zi Yan 4 months ago

On 8 Oct 2025, at 5:04, Yafang Shao wrote:

> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 10:18, Yafang Shao wrote:
>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>
>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>> +                                     enum tva_type type,
>>>>>> +                                     unsigned long orders)
>>>>>> +{
>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>> +       int bpf_order;
>>>>>> +
>>>>>> +       /* No BPF program is attached */
>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>> +                     &transparent_hugepage_flags))
>>>>>> +               return orders;
>>>>>> +
>>>>>> +       rcu_read_lock();
>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>> +               goto out;
>>>>>> +
>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>> +       orders &= BIT(bpf_order);
>>>>>> +
>>>>>> +out:
>>>>>> +       rcu_read_unlock();
>>>>>> +       return orders;
>>>>>> +}
>>>>>
>>>>> I thought I explained it earlier.
>>>>> Nack to a single global prog approach.
>>>>
>>>> I agree. We should have the option to either specify a policy globally,
>>>> or more refined for cgroups/processes.
>>>>
>>>> It's an interesting question if a program would ever want to ship its
>>>> own policy: I can see use cases for that.
>>>>
>>>> So I agree that we should make it more flexible right from the start.
>>>
>>> To achieve per-process granularity, the struct-ops must be embedded
>>> within the mm_struct as follows:
>>>
>>> +#ifdef CONFIG_BPF_MM
>>> +struct bpf_mm_ops {
>>> +#ifdef CONFIG_BPF_THP
>>> +       struct bpf_thp_ops bpf_thp;
>>> +#endif
>>> +};
>>> +#endif
>>> +
>>>   /*
>>>    * Opaque type representing current mm_struct flag state. Must be accessed via
>>>    * mm_flags_xxx() helper functions.
>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>   #ifdef CONFIG_MM_ID
>>>                  mm_id_t mm_id;
>>>   #endif /* CONFIG_MM_ID */
>>> +
>>> +#ifdef CONFIG_BPF_MM
>>> +               struct bpf_mm_ops bpf_mm;
>>> +#endif
>>>          } __randomize_layout;
>>>
>>> We should be aware that this will involve extensive changes in mm/.
>>
>> That's what we do on linux-mm :)
>>
>> It would be great to use Alexei's feedback/experience to come up with
>> something that is flexible for various use cases.
>
> I'm still not entirely convinced that allowing individual processes or
> cgroups to run independent progs is a valid use case. However, since
> we have a consensus that this is the right direction, I will proceed
> with this approach.
>
>>
>> So I think this is likely the right direction.
>>
>> It would be great to evaluate which scenarios we could unlock with this
>> (global vs. per-process vs. per-cgroup) approach, and how
>> extensive/involved the changes will be.
>
> 1. Global Approach
>    - Pros:
>      Simple;
>      Can manage different THP policies for different cgroups or processes.
>   - Cons:
>      Does not allow individual processes to run their own BPF programs.
>
> 2. Per-Process Approach
>     - Pros:
>       Enables each process to run its own BPF program.
>     - Cons:
>       Introduces significant complexity, as it requires handling the
> BPF program's lifecycle (creation, destruction, inheritance) within
> every mm_struct.
>
> 3. Per-Cgroup Approach
>     - Pros:
>        Allows individual cgroups to run their own BPF programs.
>        Less complex than the per-process model, as it can leverage the
> existing cgroup operations structure.
>     - Cons:
>        Creates a dependency on the cgroup subsystem.
>        might not be easy to control at the per-process level.

Another issue is that how and who to deal with hierarchical cgroup, where one
cgroup is a parent of another. Should bpf program to do that or mm code
to do that? I remember hierarchical cgroup is the main reason THP control
at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
get the same rejection from cgroup folks?


>
>>
>> If we need a slot in the bi-weekly mm alignment session to brainstorm,
>> we can ask Dave R. for one in the upcoming weeks.
>
> I will draft an RFC to outline the required changes in both the mm/
> and bpf/ subsystems and solicit feedback.
>
> --
> Regards
> Yafang


--
Best Regards,
Yan, Zi

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by David Hildenbrand 4 months ago

On 08.10.25 13:27, Zi Yan wrote:
> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> 
>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>
>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>> +                                     enum tva_type type,
>>>>>>> +                                     unsigned long orders)
>>>>>>> +{
>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>> +       int bpf_order;
>>>>>>> +
>>>>>>> +       /* No BPF program is attached */
>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>> +               return orders;
>>>>>>> +
>>>>>>> +       rcu_read_lock();
>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>> +               goto out;
>>>>>>> +
>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>> +
>>>>>>> +out:
>>>>>>> +       rcu_read_unlock();
>>>>>>> +       return orders;
>>>>>>> +}
>>>>>>
>>>>>> I thought I explained it earlier.
>>>>>> Nack to a single global prog approach.
>>>>>
>>>>> I agree. We should have the option to either specify a policy globally,
>>>>> or more refined for cgroups/processes.
>>>>>
>>>>> It's an interesting question if a program would ever want to ship its
>>>>> own policy: I can see use cases for that.
>>>>>
>>>>> So I agree that we should make it more flexible right from the start.
>>>>
>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>> within the mm_struct as follows:
>>>>
>>>> +#ifdef CONFIG_BPF_MM
>>>> +struct bpf_mm_ops {
>>>> +#ifdef CONFIG_BPF_THP
>>>> +       struct bpf_thp_ops bpf_thp;
>>>> +#endif
>>>> +};
>>>> +#endif
>>>> +
>>>>    /*
>>>>     * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>     * mm_flags_xxx() helper functions.
>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>    #ifdef CONFIG_MM_ID
>>>>                   mm_id_t mm_id;
>>>>    #endif /* CONFIG_MM_ID */
>>>> +
>>>> +#ifdef CONFIG_BPF_MM
>>>> +               struct bpf_mm_ops bpf_mm;
>>>> +#endif
>>>>           } __randomize_layout;
>>>>
>>>> We should be aware that this will involve extensive changes in mm/.
>>>
>>> That's what we do on linux-mm :)
>>>
>>> It would be great to use Alexei's feedback/experience to come up with
>>> something that is flexible for various use cases.
>>
>> I'm still not entirely convinced that allowing individual processes or
>> cgroups to run independent progs is a valid use case. However, since
>> we have a consensus that this is the right direction, I will proceed
>> with this approach.
>>
>>>
>>> So I think this is likely the right direction.
>>>
>>> It would be great to evaluate which scenarios we could unlock with this
>>> (global vs. per-process vs. per-cgroup) approach, and how
>>> extensive/involved the changes will be.
>>
>> 1. Global Approach
>>     - Pros:
>>       Simple;
>>       Can manage different THP policies for different cgroups or processes.
>>    - Cons:
>>       Does not allow individual processes to run their own BPF programs.
>>
>> 2. Per-Process Approach
>>      - Pros:
>>        Enables each process to run its own BPF program.
>>      - Cons:
>>        Introduces significant complexity, as it requires handling the
>> BPF program's lifecycle (creation, destruction, inheritance) within
>> every mm_struct.
>>
>> 3. Per-Cgroup Approach
>>      - Pros:
>>         Allows individual cgroups to run their own BPF programs.
>>         Less complex than the per-process model, as it can leverage the
>> existing cgroup operations structure.
>>      - Cons:
>>         Creates a dependency on the cgroup subsystem.
>>         might not be easy to control at the per-process level.
> 
> Another issue is that how and who to deal with hierarchical cgroup, where one
> cgroup is a parent of another. Should bpf program to do that or mm code
> to do that? I remember hierarchical cgroup is the main reason THP control
> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> get the same rejection from cgroup folks?

Valid point.

I do wonder if that problem was already encountered elsewhere with bpf 
and if there is already a solution.

Focusing on processes instead of cgroups might be easier initially.

-- 
Cheers

David / dhildenb

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 13:27, Zi Yan wrote:
> > On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >
> >> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>
> >>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>> +                                     enum tva_type type,
> >>>>>>> +                                     unsigned long orders)
> >>>>>>> +{
> >>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>> +       int bpf_order;
> >>>>>>> +
> >>>>>>> +       /* No BPF program is attached */
> >>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>> +                     &transparent_hugepage_flags))
> >>>>>>> +               return orders;
> >>>>>>> +
> >>>>>>> +       rcu_read_lock();
> >>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>> +               goto out;
> >>>>>>> +
> >>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>> +       orders &= BIT(bpf_order);
> >>>>>>> +
> >>>>>>> +out:
> >>>>>>> +       rcu_read_unlock();
> >>>>>>> +       return orders;
> >>>>>>> +}
> >>>>>>
> >>>>>> I thought I explained it earlier.
> >>>>>> Nack to a single global prog approach.
> >>>>>
> >>>>> I agree. We should have the option to either specify a policy globally,
> >>>>> or more refined for cgroups/processes.
> >>>>>
> >>>>> It's an interesting question if a program would ever want to ship its
> >>>>> own policy: I can see use cases for that.
> >>>>>
> >>>>> So I agree that we should make it more flexible right from the start.
> >>>>
> >>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>> within the mm_struct as follows:
> >>>>
> >>>> +#ifdef CONFIG_BPF_MM
> >>>> +struct bpf_mm_ops {
> >>>> +#ifdef CONFIG_BPF_THP
> >>>> +       struct bpf_thp_ops bpf_thp;
> >>>> +#endif
> >>>> +};
> >>>> +#endif
> >>>> +
> >>>>    /*
> >>>>     * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>     * mm_flags_xxx() helper functions.
> >>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>    #ifdef CONFIG_MM_ID
> >>>>                   mm_id_t mm_id;
> >>>>    #endif /* CONFIG_MM_ID */
> >>>> +
> >>>> +#ifdef CONFIG_BPF_MM
> >>>> +               struct bpf_mm_ops bpf_mm;
> >>>> +#endif
> >>>>           } __randomize_layout;
> >>>>
> >>>> We should be aware that this will involve extensive changes in mm/.
> >>>
> >>> That's what we do on linux-mm :)
> >>>
> >>> It would be great to use Alexei's feedback/experience to come up with
> >>> something that is flexible for various use cases.
> >>
> >> I'm still not entirely convinced that allowing individual processes or
> >> cgroups to run independent progs is a valid use case. However, since
> >> we have a consensus that this is the right direction, I will proceed
> >> with this approach.
> >>
> >>>
> >>> So I think this is likely the right direction.
> >>>
> >>> It would be great to evaluate which scenarios we could unlock with this
> >>> (global vs. per-process vs. per-cgroup) approach, and how
> >>> extensive/involved the changes will be.
> >>
> >> 1. Global Approach
> >>     - Pros:
> >>       Simple;
> >>       Can manage different THP policies for different cgroups or processes.
> >>    - Cons:
> >>       Does not allow individual processes to run their own BPF programs.
> >>
> >> 2. Per-Process Approach
> >>      - Pros:
> >>        Enables each process to run its own BPF program.
> >>      - Cons:
> >>        Introduces significant complexity, as it requires handling the
> >> BPF program's lifecycle (creation, destruction, inheritance) within
> >> every mm_struct.
> >>
> >> 3. Per-Cgroup Approach
> >>      - Pros:
> >>         Allows individual cgroups to run their own BPF programs.
> >>         Less complex than the per-process model, as it can leverage the
> >> existing cgroup operations structure.
> >>      - Cons:
> >>         Creates a dependency on the cgroup subsystem.
> >>         might not be easy to control at the per-process level.
> >
> > Another issue is that how and who to deal with hierarchical cgroup, where one
> > cgroup is a parent of another. Should bpf program to do that or mm code
> > to do that? I remember hierarchical cgroup is the main reason THP control
> > at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> > get the same rejection from cgroup folks?
>
> Valid point.
>
> I do wonder if that problem was already encountered elsewhere with bpf
> and if there is already a solution.

Our standard is to run only one instance of a BPF program type
system-wide to avoid conflicts. For example, we can't have both
systemd and a container runtime running bpf-thp simultaneously.

Perhaps Alexei can enlighten us, though we'd need to read between his
characteristically brief lines. ;-)

>
> Focusing on processes instead of cgroups might be easier initially.


-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by David Hildenbrand 4 months ago

On 08.10.25 15:11, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 13:27, Zi Yan wrote:
>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>>
>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>>> +                                     enum tva_type type,
>>>>>>>>> +                                     unsigned long orders)
>>>>>>>>> +{
>>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>>> +       int bpf_order;
>>>>>>>>> +
>>>>>>>>> +       /* No BPF program is attached */
>>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>>>> +               return orders;
>>>>>>>>> +
>>>>>>>>> +       rcu_read_lock();
>>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>>> +               goto out;
>>>>>>>>> +
>>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>>>> +
>>>>>>>>> +out:
>>>>>>>>> +       rcu_read_unlock();
>>>>>>>>> +       return orders;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> I thought I explained it earlier.
>>>>>>>> Nack to a single global prog approach.
>>>>>>>
>>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>>> or more refined for cgroups/processes.
>>>>>>>
>>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>>> own policy: I can see use cases for that.
>>>>>>>
>>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>>
>>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>>> within the mm_struct as follows:
>>>>>>
>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>> +struct bpf_mm_ops {
>>>>>> +#ifdef CONFIG_BPF_THP
>>>>>> +       struct bpf_thp_ops bpf_thp;
>>>>>> +#endif
>>>>>> +};
>>>>>> +#endif
>>>>>> +
>>>>>>     /*
>>>>>>      * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>>      * mm_flags_xxx() helper functions.
>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>>     #ifdef CONFIG_MM_ID
>>>>>>                    mm_id_t mm_id;
>>>>>>     #endif /* CONFIG_MM_ID */
>>>>>> +
>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>> +               struct bpf_mm_ops bpf_mm;
>>>>>> +#endif
>>>>>>            } __randomize_layout;
>>>>>>
>>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>>
>>>>> That's what we do on linux-mm :)
>>>>>
>>>>> It would be great to use Alexei's feedback/experience to come up with
>>>>> something that is flexible for various use cases.
>>>>
>>>> I'm still not entirely convinced that allowing individual processes or
>>>> cgroups to run independent progs is a valid use case. However, since
>>>> we have a consensus that this is the right direction, I will proceed
>>>> with this approach.
>>>>
>>>>>
>>>>> So I think this is likely the right direction.
>>>>>
>>>>> It would be great to evaluate which scenarios we could unlock with this
>>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>>> extensive/involved the changes will be.
>>>>
>>>> 1. Global Approach
>>>>      - Pros:
>>>>        Simple;
>>>>        Can manage different THP policies for different cgroups or processes.
>>>>     - Cons:
>>>>        Does not allow individual processes to run their own BPF programs.
>>>>
>>>> 2. Per-Process Approach
>>>>       - Pros:
>>>>         Enables each process to run its own BPF program.
>>>>       - Cons:
>>>>         Introduces significant complexity, as it requires handling the
>>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>>> every mm_struct.
>>>>
>>>> 3. Per-Cgroup Approach
>>>>       - Pros:
>>>>          Allows individual cgroups to run their own BPF programs.
>>>>          Less complex than the per-process model, as it can leverage the
>>>> existing cgroup operations structure.
>>>>       - Cons:
>>>>          Creates a dependency on the cgroup subsystem.
>>>>          might not be easy to control at the per-process level.
>>>
>>> Another issue is that how and who to deal with hierarchical cgroup, where one
>>> cgroup is a parent of another. Should bpf program to do that or mm code
>>> to do that? I remember hierarchical cgroup is the main reason THP control
>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>>> get the same rejection from cgroup folks?
>>
>> Valid point.
>>
>> I do wonder if that problem was already encountered elsewhere with bpf
>> and if there is already a solution.
> 
> Our standard is to run only one instance of a BPF program type
> system-wide to avoid conflicts. For example, we can't have both
> systemd and a container runtime running bpf-thp simultaneously.

Right, it's a good question how to combine policies, or "who wins".

> 
> Perhaps Alexei can enlighten us, though we'd need to read between his
> characteristically brief lines. ;-)

There might be some insights to be had in the bpf OOM discussion at

https://lkml.kernel.org/r/CAEf4BzafXv-PstSAP6krers=S74ri1+zTB4Y2oT6f+33yznqsA@mail.gmail.com

I didn't completely read through that, but that discussion also seems to 
be about interaction between cgroups and bpd programs.

-- 
Cheers

David / dhildenb

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.10.25 15:11, Yafang Shao wrote:
> > On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 13:27, Zi Yan wrote:
> >>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >>>
> >>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>
> >>>>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>>>> +                                     enum tva_type type,
> >>>>>>>>> +                                     unsigned long orders)
> >>>>>>>>> +{
> >>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>>>> +       int bpf_order;
> >>>>>>>>> +
> >>>>>>>>> +       /* No BPF program is attached */
> >>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>>>> +                     &transparent_hugepage_flags))
> >>>>>>>>> +               return orders;
> >>>>>>>>> +
> >>>>>>>>> +       rcu_read_lock();
> >>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>>>> +               goto out;
> >>>>>>>>> +
> >>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>>>> +       orders &= BIT(bpf_order);
> >>>>>>>>> +
> >>>>>>>>> +out:
> >>>>>>>>> +       rcu_read_unlock();
> >>>>>>>>> +       return orders;
> >>>>>>>>> +}
> >>>>>>>>
> >>>>>>>> I thought I explained it earlier.
> >>>>>>>> Nack to a single global prog approach.
> >>>>>>>
> >>>>>>> I agree. We should have the option to either specify a policy globally,
> >>>>>>> or more refined for cgroups/processes.
> >>>>>>>
> >>>>>>> It's an interesting question if a program would ever want to ship its
> >>>>>>> own policy: I can see use cases for that.
> >>>>>>>
> >>>>>>> So I agree that we should make it more flexible right from the start.
> >>>>>>
> >>>>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>>>> within the mm_struct as follows:
> >>>>>>
> >>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>> +struct bpf_mm_ops {
> >>>>>> +#ifdef CONFIG_BPF_THP
> >>>>>> +       struct bpf_thp_ops bpf_thp;
> >>>>>> +#endif
> >>>>>> +};
> >>>>>> +#endif
> >>>>>> +
> >>>>>>     /*
> >>>>>>      * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>>>      * mm_flags_xxx() helper functions.
> >>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>>>     #ifdef CONFIG_MM_ID
> >>>>>>                    mm_id_t mm_id;
> >>>>>>     #endif /* CONFIG_MM_ID */
> >>>>>> +
> >>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>> +               struct bpf_mm_ops bpf_mm;
> >>>>>> +#endif
> >>>>>>            } __randomize_layout;
> >>>>>>
> >>>>>> We should be aware that this will involve extensive changes in mm/.
> >>>>>
> >>>>> That's what we do on linux-mm :)
> >>>>>
> >>>>> It would be great to use Alexei's feedback/experience to come up with
> >>>>> something that is flexible for various use cases.
> >>>>
> >>>> I'm still not entirely convinced that allowing individual processes or
> >>>> cgroups to run independent progs is a valid use case. However, since
> >>>> we have a consensus that this is the right direction, I will proceed
> >>>> with this approach.
> >>>>
> >>>>>
> >>>>> So I think this is likely the right direction.
> >>>>>
> >>>>> It would be great to evaluate which scenarios we could unlock with this
> >>>>> (global vs. per-process vs. per-cgroup) approach, and how
> >>>>> extensive/involved the changes will be.
> >>>>
> >>>> 1. Global Approach
> >>>>      - Pros:
> >>>>        Simple;
> >>>>        Can manage different THP policies for different cgroups or processes.
> >>>>     - Cons:
> >>>>        Does not allow individual processes to run their own BPF programs.
> >>>>
> >>>> 2. Per-Process Approach
> >>>>       - Pros:
> >>>>         Enables each process to run its own BPF program.
> >>>>       - Cons:
> >>>>         Introduces significant complexity, as it requires handling the
> >>>> BPF program's lifecycle (creation, destruction, inheritance) within
> >>>> every mm_struct.
> >>>>
> >>>> 3. Per-Cgroup Approach
> >>>>       - Pros:
> >>>>          Allows individual cgroups to run their own BPF programs.
> >>>>          Less complex than the per-process model, as it can leverage the
> >>>> existing cgroup operations structure.
> >>>>       - Cons:
> >>>>          Creates a dependency on the cgroup subsystem.
> >>>>          might not be easy to control at the per-process level.
> >>>
> >>> Another issue is that how and who to deal with hierarchical cgroup, where one
> >>> cgroup is a parent of another. Should bpf program to do that or mm code
> >>> to do that? I remember hierarchical cgroup is the main reason THP control
> >>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> >>> get the same rejection from cgroup folks?
> >>
> >> Valid point.
> >>
> >> I do wonder if that problem was already encountered elsewhere with bpf
> >> and if there is already a solution.
> >
> > Our standard is to run only one instance of a BPF program type
> > system-wide to avoid conflicts. For example, we can't have both
> > systemd and a container runtime running bpf-thp simultaneously.
>
> Right, it's a good question how to combine policies, or "who wins".

From my perspective, the ideal approach is to have one BPF-THP
instance per mm_struct. This allows for separate managers in different
domains, such as systemd managing BPF-THP for system processes and
containerd for container processes, while ensuring that any single
process is managed by only one BPF-THP.

>
> >
> > Perhaps Alexei can enlighten us, though we'd need to read between his
> > characteristically brief lines. ;-)
>
> There might be some insights to be had in the bpf OOM discussion at
>
> https://lkml.kernel.org/r/CAEf4BzafXv-PstSAP6krers=S74ri1+zTB4Y2oT6f+33yznqsA@mail.gmail.com
>
> I didn't completely read through that, but that discussion also seems to
> be about interaction between cgroups and bpd programs.

I have reviewed the discussions.

Given that the OOM might be cgroup-specific, implementing a
cgroup-based BPF-OOM handler makes sense.

-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by David Hildenbrand 4 months ago

On 09.10.25 11:59, Yafang Shao wrote:
> On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.10.25 15:11, Yafang Shao wrote:
>>> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 08.10.25 13:27, Zi Yan wrote:
>>>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>>>>
>>>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>
>>>>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>>>>> +                                     enum tva_type type,
>>>>>>>>>>> +                                     unsigned long orders)
>>>>>>>>>>> +{
>>>>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>>>>> +       int bpf_order;
>>>>>>>>>>> +
>>>>>>>>>>> +       /* No BPF program is attached */
>>>>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>>>>>> +               return orders;
>>>>>>>>>>> +
>>>>>>>>>>> +       rcu_read_lock();
>>>>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>>>>> +               goto out;
>>>>>>>>>>> +
>>>>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>>>>>> +
>>>>>>>>>>> +out:
>>>>>>>>>>> +       rcu_read_unlock();
>>>>>>>>>>> +       return orders;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> I thought I explained it earlier.
>>>>>>>>>> Nack to a single global prog approach.
>>>>>>>>>
>>>>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>>>>> or more refined for cgroups/processes.
>>>>>>>>>
>>>>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>>>>> own policy: I can see use cases for that.
>>>>>>>>>
>>>>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>>>>
>>>>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>>>>> within the mm_struct as follows:
>>>>>>>>
>>>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>>>> +struct bpf_mm_ops {
>>>>>>>> +#ifdef CONFIG_BPF_THP
>>>>>>>> +       struct bpf_thp_ops bpf_thp;
>>>>>>>> +#endif
>>>>>>>> +};
>>>>>>>> +#endif
>>>>>>>> +
>>>>>>>>      /*
>>>>>>>>       * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>>>>       * mm_flags_xxx() helper functions.
>>>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>>>>      #ifdef CONFIG_MM_ID
>>>>>>>>                     mm_id_t mm_id;
>>>>>>>>      #endif /* CONFIG_MM_ID */
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_BPF_MM
>>>>>>>> +               struct bpf_mm_ops bpf_mm;
>>>>>>>> +#endif
>>>>>>>>             } __randomize_layout;
>>>>>>>>
>>>>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>>>>
>>>>>>> That's what we do on linux-mm :)
>>>>>>>
>>>>>>> It would be great to use Alexei's feedback/experience to come up with
>>>>>>> something that is flexible for various use cases.
>>>>>>
>>>>>> I'm still not entirely convinced that allowing individual processes or
>>>>>> cgroups to run independent progs is a valid use case. However, since
>>>>>> we have a consensus that this is the right direction, I will proceed
>>>>>> with this approach.
>>>>>>
>>>>>>>
>>>>>>> So I think this is likely the right direction.
>>>>>>>
>>>>>>> It would be great to evaluate which scenarios we could unlock with this
>>>>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>>>>> extensive/involved the changes will be.
>>>>>>
>>>>>> 1. Global Approach
>>>>>>       - Pros:
>>>>>>         Simple;
>>>>>>         Can manage different THP policies for different cgroups or processes.
>>>>>>      - Cons:
>>>>>>         Does not allow individual processes to run their own BPF programs.
>>>>>>
>>>>>> 2. Per-Process Approach
>>>>>>        - Pros:
>>>>>>          Enables each process to run its own BPF program.
>>>>>>        - Cons:
>>>>>>          Introduces significant complexity, as it requires handling the
>>>>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>>>>> every mm_struct.
>>>>>>
>>>>>> 3. Per-Cgroup Approach
>>>>>>        - Pros:
>>>>>>           Allows individual cgroups to run their own BPF programs.
>>>>>>           Less complex than the per-process model, as it can leverage the
>>>>>> existing cgroup operations structure.
>>>>>>        - Cons:
>>>>>>           Creates a dependency on the cgroup subsystem.
>>>>>>           might not be easy to control at the per-process level.
>>>>>
>>>>> Another issue is that how and who to deal with hierarchical cgroup, where one
>>>>> cgroup is a parent of another. Should bpf program to do that or mm code
>>>>> to do that? I remember hierarchical cgroup is the main reason THP control
>>>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>>>>> get the same rejection from cgroup folks?
>>>>
>>>> Valid point.
>>>>
>>>> I do wonder if that problem was already encountered elsewhere with bpf
>>>> and if there is already a solution.
>>>
>>> Our standard is to run only one instance of a BPF program type
>>> system-wide to avoid conflicts. For example, we can't have both
>>> systemd and a container runtime running bpf-thp simultaneously.
>>
>> Right, it's a good question how to combine policies, or "who wins".
> 
>  From my perspective, the ideal approach is to have one BPF-THP
> instance per mm_struct. This allows for separate managers in different
> domains, such as systemd managing BPF-THP for system processes and
> containerd for container processes, while ensuring that any single
> process is managed by only one BPF-THP.

I came to the same conclusion. At least it's a valid start.

Maybe we would later want a global fallback BPF-THP prog if none was 
enabled for a specific MM.

But I would expect to start with a per MM way of doing it, it gives you 
way more flexibility in the long run.

-- 
Cheers

David / dhildenb

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Fri, Oct 10, 2025 at 3:54 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 09.10.25 11:59, Yafang Shao wrote:
> > On Thu, Oct 9, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 15:11, Yafang Shao wrote:
> >>> On Wed, Oct 8, 2025 at 8:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 08.10.25 13:27, Zi Yan wrote:
> >>>>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
> >>>>>
> >>>>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>
> >>>>>>> On 08.10.25 10:18, Yafang Shao wrote:
> >>>>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>>>>>>> +                                     enum tva_type type,
> >>>>>>>>>>> +                                     unsigned long orders)
> >>>>>>>>>>> +{
> >>>>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>>>>>>> +       int bpf_order;
> >>>>>>>>>>> +
> >>>>>>>>>>> +       /* No BPF program is attached */
> >>>>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>>>>>>> +                     &transparent_hugepage_flags))
> >>>>>>>>>>> +               return orders;
> >>>>>>>>>>> +
> >>>>>>>>>>> +       rcu_read_lock();
> >>>>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>>>>>>> +               goto out;
> >>>>>>>>>>> +
> >>>>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>>>>>>> +       orders &= BIT(bpf_order);
> >>>>>>>>>>> +
> >>>>>>>>>>> +out:
> >>>>>>>>>>> +       rcu_read_unlock();
> >>>>>>>>>>> +       return orders;
> >>>>>>>>>>> +}
> >>>>>>>>>>
> >>>>>>>>>> I thought I explained it earlier.
> >>>>>>>>>> Nack to a single global prog approach.
> >>>>>>>>>
> >>>>>>>>> I agree. We should have the option to either specify a policy globally,
> >>>>>>>>> or more refined for cgroups/processes.
> >>>>>>>>>
> >>>>>>>>> It's an interesting question if a program would ever want to ship its
> >>>>>>>>> own policy: I can see use cases for that.
> >>>>>>>>>
> >>>>>>>>> So I agree that we should make it more flexible right from the start.
> >>>>>>>>
> >>>>>>>> To achieve per-process granularity, the struct-ops must be embedded
> >>>>>>>> within the mm_struct as follows:
> >>>>>>>>
> >>>>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>>>> +struct bpf_mm_ops {
> >>>>>>>> +#ifdef CONFIG_BPF_THP
> >>>>>>>> +       struct bpf_thp_ops bpf_thp;
> >>>>>>>> +#endif
> >>>>>>>> +};
> >>>>>>>> +#endif
> >>>>>>>> +
> >>>>>>>>      /*
> >>>>>>>>       * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>>>>>>       * mm_flags_xxx() helper functions.
> >>>>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>>>>>>      #ifdef CONFIG_MM_ID
> >>>>>>>>                     mm_id_t mm_id;
> >>>>>>>>      #endif /* CONFIG_MM_ID */
> >>>>>>>> +
> >>>>>>>> +#ifdef CONFIG_BPF_MM
> >>>>>>>> +               struct bpf_mm_ops bpf_mm;
> >>>>>>>> +#endif
> >>>>>>>>             } __randomize_layout;
> >>>>>>>>
> >>>>>>>> We should be aware that this will involve extensive changes in mm/.
> >>>>>>>
> >>>>>>> That's what we do on linux-mm :)
> >>>>>>>
> >>>>>>> It would be great to use Alexei's feedback/experience to come up with
> >>>>>>> something that is flexible for various use cases.
> >>>>>>
> >>>>>> I'm still not entirely convinced that allowing individual processes or
> >>>>>> cgroups to run independent progs is a valid use case. However, since
> >>>>>> we have a consensus that this is the right direction, I will proceed
> >>>>>> with this approach.
> >>>>>>
> >>>>>>>
> >>>>>>> So I think this is likely the right direction.
> >>>>>>>
> >>>>>>> It would be great to evaluate which scenarios we could unlock with this
> >>>>>>> (global vs. per-process vs. per-cgroup) approach, and how
> >>>>>>> extensive/involved the changes will be.
> >>>>>>
> >>>>>> 1. Global Approach
> >>>>>>       - Pros:
> >>>>>>         Simple;
> >>>>>>         Can manage different THP policies for different cgroups or processes.
> >>>>>>      - Cons:
> >>>>>>         Does not allow individual processes to run their own BPF programs.
> >>>>>>
> >>>>>> 2. Per-Process Approach
> >>>>>>        - Pros:
> >>>>>>          Enables each process to run its own BPF program.
> >>>>>>        - Cons:
> >>>>>>          Introduces significant complexity, as it requires handling the
> >>>>>> BPF program's lifecycle (creation, destruction, inheritance) within
> >>>>>> every mm_struct.
> >>>>>>
> >>>>>> 3. Per-Cgroup Approach
> >>>>>>        - Pros:
> >>>>>>           Allows individual cgroups to run their own BPF programs.
> >>>>>>           Less complex than the per-process model, as it can leverage the
> >>>>>> existing cgroup operations structure.
> >>>>>>        - Cons:
> >>>>>>           Creates a dependency on the cgroup subsystem.
> >>>>>>           might not be easy to control at the per-process level.
> >>>>>
> >>>>> Another issue is that how and who to deal with hierarchical cgroup, where one
> >>>>> cgroup is a parent of another. Should bpf program to do that or mm code
> >>>>> to do that? I remember hierarchical cgroup is the main reason THP control
> >>>>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> >>>>> get the same rejection from cgroup folks?
> >>>>
> >>>> Valid point.
> >>>>
> >>>> I do wonder if that problem was already encountered elsewhere with bpf
> >>>> and if there is already a solution.
> >>>
> >>> Our standard is to run only one instance of a BPF program type
> >>> system-wide to avoid conflicts. For example, we can't have both
> >>> systemd and a container runtime running bpf-thp simultaneously.
> >>
> >> Right, it's a good question how to combine policies, or "who wins".
> >
> >  From my perspective, the ideal approach is to have one BPF-THP
> > instance per mm_struct. This allows for separate managers in different
> > domains, such as systemd managing BPF-THP for system processes and
> > containerd for container processes, while ensuring that any single
> > process is managed by only one BPF-THP.
>
> I came to the same conclusion. At least it's a valid start.
>
> Maybe we would later want a global fallback BPF-THP prog if none was
> enabled for a specific MM.

good idea. We can fallback to the global model when attaching pid 1.

>
> But I would expect to start with a per MM way of doing it, it gives you
> way more flexibility in the long run.

THP, such as shmem and file-backed THP, are shareable across multiple
processes and cgroups. If we allow different BPF-THP policies to be
applied to these shared resources, it could lead to policy
inconsistencies. This would ultimately recreate a long-standing issue
in memcg, which still lacks a robust solution for this problem [0].

This suggests that applying SCOPED policies to SHAREABLE memory may be
fundamentally flawed ;-)

[0]. https://lore.kernel.org/linux-mm/YwNold0GMOappUxc@slm.duckdns.org/

(Added the maintainers from the old discussion to this thread.)

-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by David Hildenbrand 3 months, 4 weeks ago

>> I came to the same conclusion. At least it's a valid start.
>>
>> Maybe we would later want a global fallback BPF-THP prog if none was
>> enabled for a specific MM.
> 
> good idea. We can fallback to the global model when attaching pid 1.
> 
>>
>> But I would expect to start with a per MM way of doing it, it gives you
>> way more flexibility in the long run.
> 
> THP, such as shmem and file-backed THP, are shareable across multiple
> processes and cgroups. If we allow different BPF-THP policies to be
> applied to these shared resources, it could lead to policy
> inconsistencies.

Sure, but nothing new about that (e.g., VM_HUGEPAGE, VM_NOHUGEPAGE, 
PR_GET_THP_DISABLE).

I'd expect that we focus on anon THP as the first step either way.

Skimming over this series, anon memory seems to be the main focus.

> This would ultimately recreate a long-standing issue
> in memcg, which still lacks a robust solution for this problem [0].
> 
> This suggests that applying SCOPED policies to SHAREABLE memory may be
> fundamentally flawed ;-)

Yeah, shared memory is usually more tricky: see mempolicy handling for 
shmem. There, the policy is much rather glued to a file than to a process.

-- 
Cheers

David / dhildenb

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 3 months, 4 weeks ago

On Mon, Oct 13, 2025 at 8:42 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> I came to the same conclusion. At least it's a valid start.
> >>
> >> Maybe we would later want a global fallback BPF-THP prog if none was
> >> enabled for a specific MM.
> >
> > good idea. We can fallback to the global model when attaching pid 1.
> >
> >>
> >> But I would expect to start with a per MM way of doing it, it gives you
> >> way more flexibility in the long run.
> >
> > THP, such as shmem and file-backed THP, are shareable across multiple
> > processes and cgroups. If we allow different BPF-THP policies to be
> > applied to these shared resources, it could lead to policy
> > inconsistencies.
>
> Sure, but nothing new about that (e.g., VM_HUGEPAGE, VM_NOHUGEPAGE,
> PR_GET_THP_DISABLE).
>
> I'd expect that we focus on anon THP as the first step either way.
>
> Skimming over this series, anon memory seems to be the main focus.

Right, currently it is focusing on anon memory. In the next step it
will be extended to file-backed THP.

>
> > This would ultimately recreate a long-standing issue
> > in memcg, which still lacks a robust solution for this problem [0].
> >
> > This suggests that applying SCOPED policies to SHAREABLE memory may be
> > fundamentally flawed ;-)
>
> Yeah, shared memory is usually more tricky: see mempolicy handling for
> shmem. There, the policy is much rather glued to a file than to a process.

For shared THP we are planning to apply the THP policy based on vma->vm_file.

Consequently, the existing BPF-THP policies, which are scoped to a
process or cgroup, are incompatible with shared THP. This raises the
question of how to effectively scope policies for shared memory. While
one option is to key the policy to the file structure, this may not be
ideal as it could lead to considerable implementation and maintenance
challenges...

-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>
> > On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.10.25 10:18, Yafang Shao wrote:
> >>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
> >>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>>>
> >>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> >>>>>> +                                     enum tva_type type,
> >>>>>> +                                     unsigned long orders)
> >>>>>> +{
> >>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
> >>>>>> +       int bpf_order;
> >>>>>> +
> >>>>>> +       /* No BPF program is attached */
> >>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> >>>>>> +                     &transparent_hugepage_flags))
> >>>>>> +               return orders;
> >>>>>> +
> >>>>>> +       rcu_read_lock();
> >>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> >>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> >>>>>> +               goto out;
> >>>>>> +
> >>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> >>>>>> +       orders &= BIT(bpf_order);
> >>>>>> +
> >>>>>> +out:
> >>>>>> +       rcu_read_unlock();
> >>>>>> +       return orders;
> >>>>>> +}
> >>>>>
> >>>>> I thought I explained it earlier.
> >>>>> Nack to a single global prog approach.
> >>>>
> >>>> I agree. We should have the option to either specify a policy globally,
> >>>> or more refined for cgroups/processes.
> >>>>
> >>>> It's an interesting question if a program would ever want to ship its
> >>>> own policy: I can see use cases for that.
> >>>>
> >>>> So I agree that we should make it more flexible right from the start.
> >>>
> >>> To achieve per-process granularity, the struct-ops must be embedded
> >>> within the mm_struct as follows:
> >>>
> >>> +#ifdef CONFIG_BPF_MM
> >>> +struct bpf_mm_ops {
> >>> +#ifdef CONFIG_BPF_THP
> >>> +       struct bpf_thp_ops bpf_thp;
> >>> +#endif
> >>> +};
> >>> +#endif
> >>> +
> >>>   /*
> >>>    * Opaque type representing current mm_struct flag state. Must be accessed via
> >>>    * mm_flags_xxx() helper functions.
> >>> @@ -1268,6 +1281,10 @@ struct mm_struct {
> >>>   #ifdef CONFIG_MM_ID
> >>>                  mm_id_t mm_id;
> >>>   #endif /* CONFIG_MM_ID */
> >>> +
> >>> +#ifdef CONFIG_BPF_MM
> >>> +               struct bpf_mm_ops bpf_mm;
> >>> +#endif
> >>>          } __randomize_layout;
> >>>
> >>> We should be aware that this will involve extensive changes in mm/.
> >>
> >> That's what we do on linux-mm :)
> >>
> >> It would be great to use Alexei's feedback/experience to come up with
> >> something that is flexible for various use cases.
> >
> > I'm still not entirely convinced that allowing individual processes or
> > cgroups to run independent progs is a valid use case. However, since
> > we have a consensus that this is the right direction, I will proceed
> > with this approach.
> >
> >>
> >> So I think this is likely the right direction.
> >>
> >> It would be great to evaluate which scenarios we could unlock with this
> >> (global vs. per-process vs. per-cgroup) approach, and how
> >> extensive/involved the changes will be.
> >
> > 1. Global Approach
> >    - Pros:
> >      Simple;
> >      Can manage different THP policies for different cgroups or processes.
> >   - Cons:
> >      Does not allow individual processes to run their own BPF programs.
> >
> > 2. Per-Process Approach
> >     - Pros:
> >       Enables each process to run its own BPF program.
> >     - Cons:
> >       Introduces significant complexity, as it requires handling the
> > BPF program's lifecycle (creation, destruction, inheritance) within
> > every mm_struct.
> >
> > 3. Per-Cgroup Approach
> >     - Pros:
> >        Allows individual cgroups to run their own BPF programs.
> >        Less complex than the per-process model, as it can leverage the
> > existing cgroup operations structure.
> >     - Cons:
> >        Creates a dependency on the cgroup subsystem.
> >        might not be easy to control at the per-process level.
>
> Another issue is that how and who to deal with hierarchical cgroup, where one
> cgroup is a parent of another. Should bpf program to do that or mm code
> to do that?

The cgroup subsystem handles this propagation automatically. When a
BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
automatically inherited by all descendant cgroups.

Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
requiring us to build new attachment mechanisms for cgroup-based
struct-ops.

> I remember hierarchical cgroup is the main reason THP control
> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
> get the same rejection from cgroup folks?

Right, it was rejected by the cgroup maintainers [0]

[0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/

-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Gutierrez Asier 4 months ago

Hi,

On 10/8/2025 3:06 PM, Yafang Shao wrote:
> On Wed, Oct 8, 2025 at 7:27 PM Zi Yan <ziy@nvidia.com> wrote:
>>
>> On 8 Oct 2025, at 5:04, Yafang Shao wrote:
>>
>>> On Wed, Oct 8, 2025 at 4:28 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 08.10.25 10:18, Yafang Shao wrote:
>>>>> On Wed, Oct 8, 2025 at 4:08 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 03.10.25 04:18, Alexei Starovoitov wrote:
>>>>>>> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>>>
>>>>>>>> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
>>>>>>>> +                                     enum tva_type type,
>>>>>>>> +                                     unsigned long orders)
>>>>>>>> +{
>>>>>>>> +       thp_order_fn_t *bpf_hook_thp_get_order;
>>>>>>>> +       int bpf_order;
>>>>>>>> +
>>>>>>>> +       /* No BPF program is attached */
>>>>>>>> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>>>>>>> +                     &transparent_hugepage_flags))
>>>>>>>> +               return orders;
>>>>>>>> +
>>>>>>>> +       rcu_read_lock();
>>>>>>>> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
>>>>>>>> +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
>>>>>>>> +               goto out;
>>>>>>>> +
>>>>>>>> +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
>>>>>>>> +       orders &= BIT(bpf_order);
>>>>>>>> +
>>>>>>>> +out:
>>>>>>>> +       rcu_read_unlock();
>>>>>>>> +       return orders;
>>>>>>>> +}
>>>>>>>
>>>>>>> I thought I explained it earlier.
>>>>>>> Nack to a single global prog approach.
>>>>>>
>>>>>> I agree. We should have the option to either specify a policy globally,
>>>>>> or more refined for cgroups/processes.
>>>>>>
>>>>>> It's an interesting question if a program would ever want to ship its
>>>>>> own policy: I can see use cases for that.
>>>>>>
>>>>>> So I agree that we should make it more flexible right from the start.
>>>>>
>>>>> To achieve per-process granularity, the struct-ops must be embedded
>>>>> within the mm_struct as follows:
>>>>>
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +struct bpf_mm_ops {
>>>>> +#ifdef CONFIG_BPF_THP
>>>>> +       struct bpf_thp_ops bpf_thp;
>>>>> +#endif
>>>>> +};
>>>>> +#endif
>>>>> +
>>>>>   /*
>>>>>    * Opaque type representing current mm_struct flag state. Must be accessed via
>>>>>    * mm_flags_xxx() helper functions.
>>>>> @@ -1268,6 +1281,10 @@ struct mm_struct {
>>>>>   #ifdef CONFIG_MM_ID
>>>>>                  mm_id_t mm_id;
>>>>>   #endif /* CONFIG_MM_ID */
>>>>> +
>>>>> +#ifdef CONFIG_BPF_MM
>>>>> +               struct bpf_mm_ops bpf_mm;
>>>>> +#endif
>>>>>          } __randomize_layout;
>>>>>
>>>>> We should be aware that this will involve extensive changes in mm/.
>>>>
>>>> That's what we do on linux-mm :)
>>>>
>>>> It would be great to use Alexei's feedback/experience to come up with
>>>> something that is flexible for various use cases.
>>>
>>> I'm still not entirely convinced that allowing individual processes or
>>> cgroups to run independent progs is a valid use case. However, since
>>> we have a consensus that this is the right direction, I will proceed
>>> with this approach.
>>>
>>>>
>>>> So I think this is likely the right direction.
>>>>
>>>> It would be great to evaluate which scenarios we could unlock with this
>>>> (global vs. per-process vs. per-cgroup) approach, and how
>>>> extensive/involved the changes will be.
>>>
>>> 1. Global Approach
>>>    - Pros:
>>>      Simple;
>>>      Can manage different THP policies for different cgroups or processes.
>>>   - Cons:
>>>      Does not allow individual processes to run their own BPF programs.
>>>
>>> 2. Per-Process Approach
>>>     - Pros:
>>>       Enables each process to run its own BPF program.
>>>     - Cons:
>>>       Introduces significant complexity, as it requires handling the
>>> BPF program's lifecycle (creation, destruction, inheritance) within
>>> every mm_struct.
>>>
>>> 3. Per-Cgroup Approach
>>>     - Pros:
>>>        Allows individual cgroups to run their own BPF programs.
>>>        Less complex than the per-process model, as it can leverage the
>>> existing cgroup operations structure.
>>>     - Cons:
>>>        Creates a dependency on the cgroup subsystem.
>>>        might not be easy to control at the per-process level.
>>
>> Another issue is that how and who to deal with hierarchical cgroup, where one
>> cgroup is a parent of another. Should bpf program to do that or mm code
>> to do that?
> 
> The cgroup subsystem handles this propagation automatically. When a
> BPF program is attached to a cgroup via cgroup_bpf_attach(), it's
> automatically inherited by all descendant cgroups.
> 
> Note: struct-ops programs aren't supported by cgroup_bpf_attach(),
> requiring us to build new attachment mechanisms for cgroup-based
> struct-ops.
> 
>> I remember hierarchical cgroup is the main reason THP control
>> at cgroup level is rejected. If we do per-cgroup bpf control, wouldn't we
>> get the same rejection from cgroup folks?
> 
> Right, it was rejected by the cgroup maintainers [0]
> 
> [0]. https://lore.kernel.org/linux-mm/20241030150851.GB706616@cmpxchg.org/
> 

Yes, the patch was rejected because:

1. It breaks the cgroup hierarchy when 2 siblings have different THP policies
2. Cgroup was designed for resource management not for grouping processes and 
   tune those processes
3. We set a precedent for other people adding new flags to cgroup and 
   potentially polluting cgroups. We may end up with cgroups having tens of 
   different flags, making sysadmin's job more complex

In the MM call I proposed a new mechanism based on limits, something like 
hugetlbfs.

The main issue, still, is that the sysadmins need to set those up, making 
their life more complex.

I remember few participants mentioned the idea of the kernel setting huge page 
consumption automatically using some sort of heuristics. To be honest, I 
haven't have the time to sit and think about it. I would be glad to cooperate
and come up with a feasible solution.

-- 
Asier Gutierrez
Huawei

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Fri, Oct 3, 2025 at 10:18 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 10:59 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > +                                     enum tva_type type,
> > +                                     unsigned long orders)
> > +{
> > +       thp_order_fn_t *bpf_hook_thp_get_order;
> > +       int bpf_order;
> > +
> > +       /* No BPF program is attached */
> > +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                     &transparent_hugepage_flags))
> > +               return orders;
> > +
> > +       rcu_read_lock();
> > +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > +       if (WARN_ON_ONCE(!bpf_hook_thp_get_order))
> > +               goto out;
> > +
> > +       bpf_order = bpf_hook_thp_get_order(vma, type, orders);
> > +       orders &= BIT(bpf_order);
> > +
> > +out:
> > +       rcu_read_unlock();
> > +       return orders;
> > +}
>

Hello Alexei,

My apologies for the slow reply. I'm on a family vacation and am
checking email intermittently.

> I thought I explained it earlier.

I recall your earlier suggestion for a cgroup-based approach for
BPF-THP. However, as I mentioned, I believe cgroups might not be the
best fit[0]. My understanding was that we had agreed to move away from
that model. Could we realign on this?

[0].  https://lwn.net/ml/all/CALOAHbBvwT+6f_4gBHzPc9n_SukhAs_sa5yX=AjHYsWic1MRuw@mail.gmail.com/

> Nack to a single global prog approach.

The design of BPF-THP as a global program is a direct consequence of
its purpose: to extend the existing global
/sys/kernel/mm/transparent_hugepage/ interface. This architectural
consistency simplifies both understanding and maintenance.

Crucially, this global nature does not limit policy control. The
program is designed with the flexibility to enforce policies at
multiple levels—globally, per-cgroup, or per-task—enabling all of our
target use cases through a unified mechanism.

>
> The logic must accommodate multiple programs per-container
> or any other way from the beginning.
> If cgroup based scoping doesn't fit use per process tree scoping.

During the initial design of BPF-THP, I evaluated whether a global
program or a per-process program would be more suitable. While a
per-process design would require embedding a struct_ops into
task_struct, this seemed like over-engineering to me. We can
efficiently implement both cgroup-tree-scoped and process-tree-scoped
THP policies using existing BPF helpers, such as:

  SCOPING                        BPF kfuncs
  cgroup tree   ->  bpf_task_under_cgroup()
  process tree -> bpf_task_is_ ancestors()

With these kfuncs, there is no need to attach individual BPF-THP
programs to every process or cgroup tree. I have not identified a
valid use case that necessitates embedding a struct_ops in task_struct
which can't be achieved more simply with these kfuncs. If such use
cases exist, please detail them. Consequently, I proceeded with a
global struct_ops implementation.

The desire to attach multiple BPF-THP programs simultaneously does not
appear to be a valid use case. Furthermore, our production experience
has shown that multiple attachments often introduce conflicts. This is
precisely why system administrators prefer to manage BPF programs with
a single manager—to avoid undefined behaviors from competing programs.

Focusing specifically on BPF-THP, the semantics of the program make
multiple attachments unsuitable. A BPF-THP program's outcome is its
return value (a suggested THP order), not the side effects of its
execution. In other words, it is functionally a variant of fmod_ret.

If we allow multiple attachments and they return different values, how
do we resolve the conflict?

If one program returns order-9 and another returns order-1, which
value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
appropriate. The only logical solution is to reject subsequent
attachments and explicitly notify the user of the conflict. Our goal
should be to prevent conflicts from the outset, rather than forcing
developers to create another userspace manager to handle them.

A single global program is a natural and logical extension of the
existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
a good fit for BPF-THP and avoids unnecessary complexity.

Please provide a detailed clarification if I have misunderstood your position.

-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Alexei Starovoitov 4 months ago

On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> has shown that multiple attachments often introduce conflicts. This is
> precisely why system administrators prefer to manage BPF programs with
> a single manager—to avoid undefined behaviors from competing programs.

I don't believe this a single bit. bpf-thp didn't have any
production exposure. Everything that you said above is wishful thinking.
In actual production every programmable component needs to be
scoped in some way. One can argue that scheduling is a global
property too, yet sched-ext only works on a specific scheduling class.
All bpf program types are scoped except tracing, since kprobe/fentry
are global by definition, and even than multiple tracing programs
can be attached to the same kprobe.

> execution. In other words, it is functionally a variant of fmod_ret.

hid-bpf initially went with fmod_ret approach, deleted the whole thing
and redesigned it with _scoped_ struct-ops.

> If we allow multiple attachments and they return different values, how
> do we resolve the conflict?
>
> If one program returns order-9 and another returns order-1, which
> value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
> appropriate.

No. If you cannot figure out how to stack multiple programs
it means that the api you picked is broken.

> A single global program is a natural and logical extension of the
> existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
> a good fit for BPF-THP and avoids unnecessary complexity.

The Nack to single global prog is not negotiable.

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > has shown that multiple attachments often introduce conflicts. This is
> > precisely why system administrators prefer to manage BPF programs with
> > a single manager—to avoid undefined behaviors from competing programs.
>
> I don't believe this a single bit.

You should spend some time seeing how users are actually applying BPF
in practice. Some information for you :

https://github.com/bpfman/bpfman
https://github.com/DataDog/ebpf-manager
https://github.com/ccfos/huatuo

> bpf-thp didn't have any
> production exposure.
>  Everything that you said above is wishful thinking.

The statement above applies to other multi-attachable programs, not to bpf-thp.

> In actual production every programmable component needs to be
> scoped in some way. One can argue that scheduling is a global
> property too, yet sched-ext only works on a specific scheduling class.

I can also argue that bpf-thp only works on a specific thp mode
(madvise and always) ;-)

> All bpf program types are scoped except tracing, since kprobe/fentry
> are global by definition, and even than multiple tracing programs
> can be attached to the same kprobe.
>
> > execution. In other words, it is functionally a variant of fmod_ret.
>
> hid-bpf initially went with fmod_ret approach, deleted the whole thing
> and redesigned it with _scoped_ struct-ops.

I see little value in embedding a bpf_thp_struct_ops into the
task_struct. The benefits don't appear to justify the added
complexity.

>
> > If we allow multiple attachments and they return different values, how
> > do we resolve the conflict?
> >
> > If one program returns order-9 and another returns order-1, which
> > value should be chosen? Neither 1, 9, nor a combination (1 & 9) is
> > appropriate.
>
> No. If you cannot figure out how to stack multiple programs
> it means that the api you picked is broken.
>
> > A single global program is a natural and logical extension of the
> > existing global /sys/kernel/mm/transparent_hugepage/ interface. It is
> > a good fit for BPF-THP and avoids unnecessary complexity.
>
> The Nack to single global prog is not negotiable.

We still lack a compelling technical reason for embedding
bpf_thp_struct_ops into task_struct. Can you clearly articulate the
problem that this specific design is solving?

-- 
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Alexei Starovoitov 4 months ago

On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > has shown that multiple attachments often introduce conflicts. This is
> > > precisely why system administrators prefer to manage BPF programs with
> > > a single manager—to avoid undefined behaviors from competing programs.
> >
> > I don't believe this a single bit.
>
> You should spend some time seeing how users are actually applying BPF
> in practice. Some information for you :
>
> https://github.com/bpfman/bpfman
> https://github.com/DataDog/ebpf-manager
> https://github.com/ccfos/huatuo

By seeing the above you learned the wrong lesson.
These orchestrators and many others were created because
we made mistakes in the kernel by not scoping the progs enough.
XDP is a prime example. It allows one program per netdev.
This was a massive mistake which we're still trying to fix.

> > hid-bpf initially went with fmod_ret approach, deleted the whole thing
> > and redesigned it with _scoped_ struct-ops.
>
> I see little value in embedding a bpf_thp_struct_ops into the
> task_struct. The benefits don't appear to justify the added
> complexity.

huh? where did I say that struct-ops should be embedded in task_struct ?

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > has shown that multiple attachments often introduce conflicts. This is
> > > > precisely why system administrators prefer to manage BPF programs with
> > > > a single manager—to avoid undefined behaviors from competing programs.
> > >
> > > I don't believe this a single bit.
> >
> > You should spend some time seeing how users are actually applying BPF
> > in practice. Some information for you :
> >
> > https://github.com/bpfman/bpfman
> > https://github.com/DataDog/ebpf-manager
> > https://github.com/ccfos/huatuo
>
> By seeing the above you learned the wrong lesson.
> These orchestrators and many others were created because
> we made mistakes in the kernel by not scoping the progs enough.
> XDP is a prime example. It allows one program per netdev.
> This was a massive mistake which we're still trying to fix.

Since we don't use XDP in production, I can't comment on it. However,
for our multi-attachable cgroup BPF programs, a key issue arises: if a
program has permission to attach to one cgroup, it can attach to any
cgroup. While scoping enables attachment to individual cgroups, it
does not enforce isolation. This means we must still check for
conflicts between programs, which begs the question: what is the
functional purpose of this scoping mechanism?

>
> > > hid-bpf initially went with fmod_ret approach, deleted the whole thing
> > > and redesigned it with _scoped_ struct-ops.
> >
> > I see little value in embedding a bpf_thp_struct_ops into the
> > task_struct. The benefits don't appear to justify the added
> > complexity.
>
> huh? where did I say that struct-ops should be embedded in task_struct ?

Given that, what would you propose?
My position is that the only valid scope for bpf-thp is at the level
of specific THP modes like madvise and always. This patch correctly
implements that precise design.

--
Regards
Yafang

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Alexei Starovoitov 4 months ago

On Tue, Oct 7, 2025 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > has shown that multiple attachments often introduce conflicts. This is
> > > > > precisely why system administrators prefer to manage BPF programs with
> > > > > a single manager—to avoid undefined behaviors from competing programs.
> > > >
> > > > I don't believe this a single bit.
> > >
> > > You should spend some time seeing how users are actually applying BPF
> > > in practice. Some information for you :
> > >
> > > https://github.com/bpfman/bpfman
> > > https://github.com/DataDog/ebpf-manager
> > > https://github.com/ccfos/huatuo
> >
> > By seeing the above you learned the wrong lesson.
> > These orchestrators and many others were created because
> > we made mistakes in the kernel by not scoping the progs enough.
> > XDP is a prime example. It allows one program per netdev.
> > This was a massive mistake which we're still trying to fix.
>
> Since we don't use XDP in production, I can't comment on it. However,
> for our multi-attachable cgroup BPF programs, a key issue arises: if a
> program has permission to attach to one cgroup, it can attach to any
> cgroup. While scoping enables attachment to individual cgroups, it
> does not enforce isolation. This means we must still check for
> conflicts between programs, which begs the question: what is the
> functional purpose of this scoping mechanism?

cgroup mprog was added to remove the need for an orchestrator.

> My position is that the only valid scope for bpf-thp is at the level
> of specific THP modes like madvise and always. This patch correctly
> implements that precise design.

I'm done with this thread.

Nacked-by: Alexei Starovoitov <ast@kernel.org>

Re: [PATCH v9 mm-new 03/11] mm: thp: add support for BPF based THP order selection

Posted by Yafang Shao 4 months ago

On Wed, Oct 8, 2025 at 12:39 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Oct 7, 2025 at 9:25 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Wed, Oct 8, 2025 at 12:10 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Oct 7, 2025 at 8:51 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > On Wed, Oct 8, 2025 at 11:25 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Tue, Oct 7, 2025 at 1:47 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > > > has shown that multiple attachments often introduce conflicts. This is
> > > > > > precisely why system administrators prefer to manage BPF programs with
> > > > > > a single manager—to avoid undefined behaviors from competing programs.
> > > > >
> > > > > I don't believe this a single bit.
> > > >
> > > > You should spend some time seeing how users are actually applying BPF
> > > > in practice. Some information for you :
> > > >
> > > > https://github.com/bpfman/bpfman
> > > > https://github.com/DataDog/ebpf-manager
> > > > https://github.com/ccfos/huatuo
> > >
> > > By seeing the above you learned the wrong lesson.
> > > These orchestrators and many others were created because
> > > we made mistakes in the kernel by not scoping the progs enough.
> > > XDP is a prime example. It allows one program per netdev.
> > > This was a massive mistake which we're still trying to fix.
> >
> > Since we don't use XDP in production, I can't comment on it. However,
> > for our multi-attachable cgroup BPF programs, a key issue arises: if a
> > program has permission to attach to one cgroup, it can attach to any
> > cgroup. While scoping enables attachment to individual cgroups, it
> > does not enforce isolation. This means we must still check for
> > conflicts between programs, which begs the question: what is the
> > functional purpose of this scoping mechanism?
>
> cgroup mprog was added to remove the need for an orchestrator.

However, this approach would still require a userspace manager to
coordinate the mprog attachments and prevent conflicts between
different programs, no ?

>
> > My position is that the only valid scope for bpf-thp is at the level
> > of specific THP modes like madvise and always. This patch correctly
> > implements that precise design.
>
> I'm done with this thread.
>
> Nacked-by: Alexei Starovoitov <ast@kernel.org>

Given its experimental status, I believe any scoping mechanism would
be premature and over-engineered. Even integrating it into the
mm_struct introduces unnecessary complexity at this stage.

-- 
Regards
Yafang