Introduce a BPF kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.
It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes BPF programs
more complex and less efficient.
bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessary to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
mm/bpf_memcontrol.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 82eb95de77b7..187919eb2fe2 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -10,6 +10,25 @@
__bpf_kfunc_start_defs();
+/**
+ * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
+ *
+ * The function has KF_ACQUIRE semantics, even though the root memory
+ * cgroup is never destroyed after being created and doesn't require
+ * reference counting. And it's perfectly safe to pass it to
+ * bpf_put_mem_cgroup()
+ *
+ * Return: A pointer to the root memory cgroup.
+ */
+__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
+{
+ if (mem_cgroup_disabled())
+ return NULL;
+
+ /* css_get() is not needed */
+ return root_mem_cgroup;
+}
+
/**
* bpf_get_mem_cgroup - Get a reference to a memory cgroup
* @css: pointer to the css structure
@@ -64,6 +83,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
+BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
--
2.52.0
On Mon, Dec 22, 2025 at 08:41:53PM -0800, Roman Gushchin wrote:
> Introduce a BPF kfunc to get a trusted pointer to the root memory
> cgroup. It's very handy to traverse the full memcg tree, e.g.
> for handling a system-wide OOM.
>
> It's possible to obtain this pointer by traversing the memcg tree
> up from any known memcg, but it's sub-optimal and makes BPF programs
> more complex and less efficient.
>
> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
> however in reality it's not necessary to bump the corresponding
> reference counter - root memory cgroup is immortal, reference counting
> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
> mm/bpf_memcontrol.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index 82eb95de77b7..187919eb2fe2 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c
> @@ -10,6 +10,25 @@
>
> __bpf_kfunc_start_defs();
>
> +/**
> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
> + *
> + * The function has KF_ACQUIRE semantics, even though the root memory
> + * cgroup is never destroyed after being created and doesn't require
> + * reference counting. And it's perfectly safe to pass it to
> + * bpf_put_mem_cgroup()
> + *
> + * Return: A pointer to the root memory cgroup.
> + */
> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
> +{
> + if (mem_cgroup_disabled())
> + return NULL;
> +
> + /* css_get() is not needed */
> + return root_mem_cgroup;
> +}
> +
> /**
> * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> * @css: pointer to the css structure
> @@ -64,6 +83,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> __bpf_kfunc_end_defs();
>
> BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
I feel as though relying on KF_ACQUIRE semantics here is somewhat
odd. Users of this BPF kfunc will now be forced to call
bpf_put_mem_cgroup() on the returned root_mem_cgroup, despite it being
completely unnecessary.
Perhaps we should consider introducing a new KF bit/value which
essentially allows such BPF kfuncs to also have their returned
pointers implicitly marked as "trusted", similar to that of the legacy
RET_PTR_TO_BTF_ID_TRUSTED. What do you think? That way it obviates the
requirement to call into any backing KF_RELEASE BPF kfunc after the
fact.
Matt Bobrowski <mattbobrowski@google.com> writes:
> On Mon, Dec 22, 2025 at 08:41:53PM -0800, Roman Gushchin wrote:
>> Introduce a BPF kfunc to get a trusted pointer to the root memory
>> cgroup. It's very handy to traverse the full memcg tree, e.g.
>> for handling a system-wide OOM.
>>
>> It's possible to obtain this pointer by traversing the memcg tree
>> up from any known memcg, but it's sub-optimal and makes BPF programs
>> more complex and less efficient.
>>
>> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
>> however in reality it's not necessary to bump the corresponding
>> reference counter - root memory cgroup is immortal, reference counting
>> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
>> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
>> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>> mm/bpf_memcontrol.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
>> index 82eb95de77b7..187919eb2fe2 100644
>> --- a/mm/bpf_memcontrol.c
>> +++ b/mm/bpf_memcontrol.c
>> @@ -10,6 +10,25 @@
>>
>> __bpf_kfunc_start_defs();
>>
>> +/**
>> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
>> + *
>> + * The function has KF_ACQUIRE semantics, even though the root memory
>> + * cgroup is never destroyed after being created and doesn't require
>> + * reference counting. And it's perfectly safe to pass it to
>> + * bpf_put_mem_cgroup()
>> + *
>> + * Return: A pointer to the root memory cgroup.
>> + */
>> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
>> +{
>> + if (mem_cgroup_disabled())
>> + return NULL;
>> +
>> + /* css_get() is not needed */
>> + return root_mem_cgroup;
>> +}
>> +
>> /**
>> * bpf_get_mem_cgroup - Get a reference to a memory cgroup
>> * @css: pointer to the css structure
>> @@ -64,6 +83,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>> __bpf_kfunc_end_defs();
>>
>> BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
>> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
>
> I feel as though relying on KF_ACQUIRE semantics here is somewhat
> odd. Users of this BPF kfunc will now be forced to call
> bpf_put_mem_cgroup() on the returned root_mem_cgroup, despite it being
> completely unnecessary.
A agree that it's annoying, but I doubt this extra call makes any
difference in the real world.
Also, the corresponding kernel code designed to hide the special
handling of the root cgroup. css_get()/css_put() are simple no-ops for
the root cgroup, but are totally valid. So in most places the root
cgroup is handled as any other, which simplifies the code. I guess
the same will be true for many bpf programs.
Thanks!
On Tue, Dec 30, 2025 at 09:00:28PM +0000, Roman Gushchin wrote:
> Matt Bobrowski <mattbobrowski@google.com> writes:
>
> > On Mon, Dec 22, 2025 at 08:41:53PM -0800, Roman Gushchin wrote:
> >> Introduce a BPF kfunc to get a trusted pointer to the root memory
> >> cgroup. It's very handy to traverse the full memcg tree, e.g.
> >> for handling a system-wide OOM.
> >>
> >> It's possible to obtain this pointer by traversing the memcg tree
> >> up from any known memcg, but it's sub-optimal and makes BPF programs
> >> more complex and less efficient.
> >>
> >> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
> >> however in reality it's not necessary to bump the corresponding
> >> reference counter - root memory cgroup is immortal, reference counting
> >> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
> >> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
> >> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
> >>
> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> >> ---
> >> mm/bpf_memcontrol.c | 20 ++++++++++++++++++++
> >> 1 file changed, 20 insertions(+)
> >>
> >> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> >> index 82eb95de77b7..187919eb2fe2 100644
> >> --- a/mm/bpf_memcontrol.c
> >> +++ b/mm/bpf_memcontrol.c
> >> @@ -10,6 +10,25 @@
> >>
> >> __bpf_kfunc_start_defs();
> >>
> >> +/**
> >> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
> >> + *
> >> + * The function has KF_ACQUIRE semantics, even though the root memory
> >> + * cgroup is never destroyed after being created and doesn't require
> >> + * reference counting. And it's perfectly safe to pass it to
> >> + * bpf_put_mem_cgroup()
> >> + *
> >> + * Return: A pointer to the root memory cgroup.
> >> + */
> >> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
> >> +{
> >> + if (mem_cgroup_disabled())
> >> + return NULL;
> >> +
> >> + /* css_get() is not needed */
> >> + return root_mem_cgroup;
> >> +}
> >> +
> >> /**
> >> * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> >> * @css: pointer to the css structure
> >> @@ -64,6 +83,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> >> __bpf_kfunc_end_defs();
> >>
> >> BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> >> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
> >
> > I feel as though relying on KF_ACQUIRE semantics here is somewhat
> > odd. Users of this BPF kfunc will now be forced to call
> > bpf_put_mem_cgroup() on the returned root_mem_cgroup, despite it being
> > completely unnecessary.
>
> A agree that it's annoying, but I doubt this extra call makes any
> difference in the real world.
Sure, that certainly holds true.
> Also, the corresponding kernel code designed to hide the special
> handling of the root cgroup. css_get()/css_put() are simple no-ops for
> the root cgroup, but are totally valid.
Yes, I do see that.
> So in most places the root cgroup is handled as any other, which
> simplifies the code. I guess the same will be true for many bpf
> programs.
I see, however the same might not necessarily hold for all other
global pointers which end up being handed out by a BPF kfunc (not
necessarily bpf_get_root_mem_cgroup()). This is why I was wondering
whether there's some sense to introducing another KF flag (or
something similar) which allows returned values from BPF kfuncs to be
implicitly treated as trusted.
On Tue, Dec 30, 2025 at 11:42 PM Matt Bobrowski
<mattbobrowski@google.com> wrote:
>
> On Tue, Dec 30, 2025 at 09:00:28PM +0000, Roman Gushchin wrote:
> > Matt Bobrowski <mattbobrowski@google.com> writes:
> >
> > > On Mon, Dec 22, 2025 at 08:41:53PM -0800, Roman Gushchin wrote:
> > >> Introduce a BPF kfunc to get a trusted pointer to the root memory
> > >> cgroup. It's very handy to traverse the full memcg tree, e.g.
> > >> for handling a system-wide OOM.
> > >>
> > >> It's possible to obtain this pointer by traversing the memcg tree
> > >> up from any known memcg, but it's sub-optimal and makes BPF programs
> > >> more complex and less efficient.
> > >>
> > >> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
> > >> however in reality it's not necessary to bump the corresponding
> > >> reference counter - root memory cgroup is immortal, reference counting
> > >> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
> > >> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
> > >> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
> > >>
> > >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > >> ---
> > >> mm/bpf_memcontrol.c | 20 ++++++++++++++++++++
> > >> 1 file changed, 20 insertions(+)
> > >>
> > >> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> > >> index 82eb95de77b7..187919eb2fe2 100644
> > >> --- a/mm/bpf_memcontrol.c
> > >> +++ b/mm/bpf_memcontrol.c
> > >> @@ -10,6 +10,25 @@
> > >>
> > >> __bpf_kfunc_start_defs();
> > >>
> > >> +/**
> > >> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
> > >> + *
> > >> + * The function has KF_ACQUIRE semantics, even though the root memory
> > >> + * cgroup is never destroyed after being created and doesn't require
> > >> + * reference counting. And it's perfectly safe to pass it to
> > >> + * bpf_put_mem_cgroup()
> > >> + *
> > >> + * Return: A pointer to the root memory cgroup.
> > >> + */
> > >> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
> > >> +{
> > >> + if (mem_cgroup_disabled())
> > >> + return NULL;
> > >> +
> > >> + /* css_get() is not needed */
> > >> + return root_mem_cgroup;
> > >> +}
> > >> +
> > >> /**
> > >> * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> > >> * @css: pointer to the css structure
> > >> @@ -64,6 +83,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > >> __bpf_kfunc_end_defs();
> > >>
> > >> BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> > >> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
> > >
> > > I feel as though relying on KF_ACQUIRE semantics here is somewhat
> > > odd. Users of this BPF kfunc will now be forced to call
> > > bpf_put_mem_cgroup() on the returned root_mem_cgroup, despite it being
> > > completely unnecessary.
> >
> > A agree that it's annoying, but I doubt this extra call makes any
> > difference in the real world.
>
> Sure, that certainly holds true.
>
> > Also, the corresponding kernel code designed to hide the special
> > handling of the root cgroup. css_get()/css_put() are simple no-ops for
> > the root cgroup, but are totally valid.
>
> Yes, I do see that.
>
> > So in most places the root cgroup is handled as any other, which
> > simplifies the code. I guess the same will be true for many bpf
> > programs.
>
> I see, however the same might not necessarily hold for all other
> global pointers which end up being handed out by a BPF kfunc (not
> necessarily bpf_get_root_mem_cgroup()). This is why I was wondering
> whether there's some sense to introducing another KF flag (or
> something similar) which allows returned values from BPF kfuncs to be
> implicitly treated as trusted.
No need for a new KF flag. Any struct returned by kfunc should be
trusted or trusted_or_null if KF_RET_NULL was specified.
I don't remember off the top of my head, but this behavior
is already implemented or we discussed making it this way.
On Wed, Dec 31, 2025 at 09:32:17AM -0800, Alexei Starovoitov wrote:
> On Tue, Dec 30, 2025 at 11:42 PM Matt Bobrowski
> <mattbobrowski@google.com> wrote:
> >
> > On Tue, Dec 30, 2025 at 09:00:28PM +0000, Roman Gushchin wrote:
> > > Matt Bobrowski <mattbobrowski@google.com> writes:
> > >
> > > > On Mon, Dec 22, 2025 at 08:41:53PM -0800, Roman Gushchin wrote:
> > > >> Introduce a BPF kfunc to get a trusted pointer to the root memory
> > > >> cgroup. It's very handy to traverse the full memcg tree, e.g.
> > > >> for handling a system-wide OOM.
> > > >>
> > > >> It's possible to obtain this pointer by traversing the memcg tree
> > > >> up from any known memcg, but it's sub-optimal and makes BPF programs
> > > >> more complex and less efficient.
> > > >>
> > > >> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
> > > >> however in reality it's not necessary to bump the corresponding
> > > >> reference counter - root memory cgroup is immortal, reference counting
> > > >> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
> > > >> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
> > > >> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
> > > >>
> > > >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> > > >> ---
> > > >> mm/bpf_memcontrol.c | 20 ++++++++++++++++++++
> > > >> 1 file changed, 20 insertions(+)
> > > >>
> > > >> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> > > >> index 82eb95de77b7..187919eb2fe2 100644
> > > >> --- a/mm/bpf_memcontrol.c
> > > >> +++ b/mm/bpf_memcontrol.c
> > > >> @@ -10,6 +10,25 @@
> > > >>
> > > >> __bpf_kfunc_start_defs();
> > > >>
> > > >> +/**
> > > >> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
> > > >> + *
> > > >> + * The function has KF_ACQUIRE semantics, even though the root memory
> > > >> + * cgroup is never destroyed after being created and doesn't require
> > > >> + * reference counting. And it's perfectly safe to pass it to
> > > >> + * bpf_put_mem_cgroup()
> > > >> + *
> > > >> + * Return: A pointer to the root memory cgroup.
> > > >> + */
> > > >> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
> > > >> +{
> > > >> + if (mem_cgroup_disabled())
> > > >> + return NULL;
> > > >> +
> > > >> + /* css_get() is not needed */
> > > >> + return root_mem_cgroup;
> > > >> +}
> > > >> +
> > > >> /**
> > > >> * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> > > >> * @css: pointer to the css structure
> > > >> @@ -64,6 +83,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > > >> __bpf_kfunc_end_defs();
> > > >>
> > > >> BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> > > >> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
> > > >
> > > > I feel as though relying on KF_ACQUIRE semantics here is somewhat
> > > > odd. Users of this BPF kfunc will now be forced to call
> > > > bpf_put_mem_cgroup() on the returned root_mem_cgroup, despite it being
> > > > completely unnecessary.
> > >
> > > A agree that it's annoying, but I doubt this extra call makes any
> > > difference in the real world.
> >
> > Sure, that certainly holds true.
> >
> > > Also, the corresponding kernel code designed to hide the special
> > > handling of the root cgroup. css_get()/css_put() are simple no-ops for
> > > the root cgroup, but are totally valid.
> >
> > Yes, I do see that.
> >
> > > So in most places the root cgroup is handled as any other, which
> > > simplifies the code. I guess the same will be true for many bpf
> > > programs.
> >
> > I see, however the same might not necessarily hold for all other
> > global pointers which end up being handed out by a BPF kfunc (not
> > necessarily bpf_get_root_mem_cgroup()). This is why I was wondering
> > whether there's some sense to introducing another KF flag (or
> > something similar) which allows returned values from BPF kfuncs to be
> > implicitly treated as trusted.
>
> No need for a new KF flag. Any struct returned by kfunc should be
> trusted or trusted_or_null if KF_RET_NULL was specified.
> I don't remember off the top of my head, but this behavior
> is already implemented or we discussed making it this way.
Hm, I do not see any evidence of this kind of semantic currently
implemented, so perhaps it was only discussed at some point. Would you
like me to put forward a patch that introduces this kind of implicit
trust semantic for BPF kfuncs returning pointer to struct types?
On Sun, Jan 4, 2026 at 11:49 PM Matt Bobrowski <mattbobrowski@google.com> wrote: > > > > > No need for a new KF flag. Any struct returned by kfunc should be > > trusted or trusted_or_null if KF_RET_NULL was specified. > > I don't remember off the top of my head, but this behavior > > is already implemented or we discussed making it this way. > > Hm, I do not see any evidence of this kind of semantic currently > implemented, so perhaps it was only discussed at some point. Would you > like me to put forward a patch that introduces this kind of implicit > trust semantic for BPF kfuncs returning pointer to struct types? Hmm. What about these: BTF_ID_FLAGS(func, scx_bpf_cpu_rq) BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_RET_NULL) BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_RET_NULL | KF_RCU_PROTECTED) I thought they're returning a trusted pointer without acquiring it. iirc the last one returns trusted in RCU CS, but the first two return just a legacy ptr_to_btf_id ? This is something to fix asap then.
On Mon, Jan 05, 2026 at 08:05:54AM -0800, Alexei Starovoitov wrote: > On Sun, Jan 4, 2026 at 11:49 PM Matt Bobrowski <mattbobrowski@google.com> wrote: > > > > > > > > No need for a new KF flag. Any struct returned by kfunc should be > > > trusted or trusted_or_null if KF_RET_NULL was specified. > > > I don't remember off the top of my head, but this behavior > > > is already implemented or we discussed making it this way. > > > > Hm, I do not see any evidence of this kind of semantic currently > > implemented, so perhaps it was only discussed at some point. Would you > > like me to put forward a patch that introduces this kind of implicit > > trust semantic for BPF kfuncs returning pointer to struct types? > > Hmm. What about these: > BTF_ID_FLAGS(func, scx_bpf_cpu_rq) > BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_RET_NULL) > BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_RET_NULL | KF_RCU_PROTECTED) > > I thought they're returning a trusted pointer without acquiring it. > iirc the last one returns trusted in RCU CS, > but the first two return just a legacy ptr_to_btf_id ? > This is something to fix asap then. No, AFAIU they do not. These simply return a regular pointer to BTF ID (PTR_TO_BTF_ID), rather than a formally "trusted" pointer (which would carry the PTR_TRUSTED flag or a ref_obj_id). scx_bpf_cpu_curr returns a MEM_RCU pointer (via KF_RCU_PROTECTED), which is somewhat considered to be trusted within a RCU read-side critical section *ONLY*. Kumar/Tejun, Please keep me honest here.
On Mon, 5 Jan 2026 at 22:04, Matt Bobrowski <mattbobrowski@google.com> wrote: > > On Mon, Jan 05, 2026 at 08:05:54AM -0800, Alexei Starovoitov wrote: > > On Sun, Jan 4, 2026 at 11:49 PM Matt Bobrowski <mattbobrowski@google.com> wrote: > > > > > > > > > > > No need for a new KF flag. Any struct returned by kfunc should be > > > > trusted or trusted_or_null if KF_RET_NULL was specified. > > > > I don't remember off the top of my head, but this behavior > > > > is already implemented or we discussed making it this way. > > > > > > Hm, I do not see any evidence of this kind of semantic currently > > > implemented, so perhaps it was only discussed at some point. Would you > > > like me to put forward a patch that introduces this kind of implicit > > > trust semantic for BPF kfuncs returning pointer to struct types? > > > > Hmm. What about these: > > BTF_ID_FLAGS(func, scx_bpf_cpu_rq) > > BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_RET_NULL) > > BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_RET_NULL | KF_RCU_PROTECTED) > > > > I thought they're returning a trusted pointer without acquiring it. > > iirc the last one returns trusted in RCU CS, > > but the first two return just a legacy ptr_to_btf_id ? > > This is something to fix asap then. > > No, AFAIU they do not. These simply return a regular pointer to BTF ID > (PTR_TO_BTF_ID), rather than a formally "trusted" pointer (which would > carry the PTR_TRUSTED flag or a ref_obj_id). scx_bpf_cpu_curr returns > a MEM_RCU pointer (via KF_RCU_PROTECTED), which is somewhat considered > to be trusted within a RCU read-side critical section *ONLY*. > > Kumar/Tejun, Yeah, they don't return a trusted pointer. I think it would make sense to change the behavior here by default. A non-trusted pointer cannot be passed to kfuncs taking trusted arguments, so hopefully it will only make things more permissive and doesn't break anything. > > Please keep me honest here.
On Tue, Jan 06, 2026 at 04:13:24PM +0100, Kumar Kartikeya Dwivedi wrote: > On Mon, 5 Jan 2026 at 22:04, Matt Bobrowski <mattbobrowski@google.com> wrote: > > > > On Mon, Jan 05, 2026 at 08:05:54AM -0800, Alexei Starovoitov wrote: > > > On Sun, Jan 4, 2026 at 11:49 PM Matt Bobrowski <mattbobrowski@google.com> wrote: > > > > > > > > > > > > > > No need for a new KF flag. Any struct returned by kfunc should be > > > > > trusted or trusted_or_null if KF_RET_NULL was specified. > > > > > I don't remember off the top of my head, but this behavior > > > > > is already implemented or we discussed making it this way. > > > > > > > > Hm, I do not see any evidence of this kind of semantic currently > > > > implemented, so perhaps it was only discussed at some point. Would you > > > > like me to put forward a patch that introduces this kind of implicit > > > > trust semantic for BPF kfuncs returning pointer to struct types? > > > > > > Hmm. What about these: > > > BTF_ID_FLAGS(func, scx_bpf_cpu_rq) > > > BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_RET_NULL) > > > BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_RET_NULL | KF_RCU_PROTECTED) > > > > > > I thought they're returning a trusted pointer without acquiring it. > > > iirc the last one returns trusted in RCU CS, > > > but the first two return just a legacy ptr_to_btf_id ? > > > This is something to fix asap then. > > > > No, AFAIU they do not. These simply return a regular pointer to BTF ID > > (PTR_TO_BTF_ID), rather than a formally "trusted" pointer (which would > > carry the PTR_TRUSTED flag or a ref_obj_id). scx_bpf_cpu_curr returns > > a MEM_RCU pointer (via KF_RCU_PROTECTED), which is somewhat considered > > to be trusted within a RCU read-side critical section *ONLY*. > > > > Kumar/Tejun, > > Yeah, they don't return a trusted pointer. I think it would make sense > to change the behavior here by default. Thanks for chiming in and confirming this Kumar! I also agree that any BPF kfunc returning a pointer should be treated as being implicitly trusted by default. I can't think of any scenario whereby a BPF kfunc would want to return a pointer that'd fundamentally be untrusted, but there always could be some exceptions. Anyway, I will work on this and send something through for review soon. > A non-trusted pointer cannot be passed to kfuncs taking trusted > arguments, so hopefully it will only make things more permissive and > doesn't break anything. We can only hope! ;)
Matt Bobrowski <mattbobrowski@google.com> writes:
> On Tue, Dec 30, 2025 at 09:00:28PM +0000, Roman Gushchin wrote:
>> Matt Bobrowski <mattbobrowski@google.com> writes:
>>
>> > On Mon, Dec 22, 2025 at 08:41:53PM -0800, Roman Gushchin wrote:
>> >> Introduce a BPF kfunc to get a trusted pointer to the root memory
>> >> cgroup. It's very handy to traverse the full memcg tree, e.g.
>> >> for handling a system-wide OOM.
>> >>
>> >> It's possible to obtain this pointer by traversing the memcg tree
>> >> up from any known memcg, but it's sub-optimal and makes BPF programs
>> >> more complex and less efficient.
>> >>
>> >> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
>> >> however in reality it's not necessary to bump the corresponding
>> >> reference counter - root memory cgroup is immortal, reference counting
>> >> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
>> >> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
>> >> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
>> >>
>> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> >> ---
>> >> mm/bpf_memcontrol.c | 20 ++++++++++++++++++++
>> >> 1 file changed, 20 insertions(+)
>> >>
>> >> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
>> >> index 82eb95de77b7..187919eb2fe2 100644
>> >> --- a/mm/bpf_memcontrol.c
>> >> +++ b/mm/bpf_memcontrol.c
>> >> @@ -10,6 +10,25 @@
>> >>
>> >> __bpf_kfunc_start_defs();
>> >>
>> >> +/**
>> >> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
>> >> + *
>> >> + * The function has KF_ACQUIRE semantics, even though the root memory
>> >> + * cgroup is never destroyed after being created and doesn't require
>> >> + * reference counting. And it's perfectly safe to pass it to
>> >> + * bpf_put_mem_cgroup()
>> >> + *
>> >> + * Return: A pointer to the root memory cgroup.
>> >> + */
>> >> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
>> >> +{
>> >> + if (mem_cgroup_disabled())
>> >> + return NULL;
>> >> +
>> >> + /* css_get() is not needed */
>> >> + return root_mem_cgroup;
>> >> +}
>> >> +
>> >> /**
>> >> * bpf_get_mem_cgroup - Get a reference to a memory cgroup
>> >> * @css: pointer to the css structure
>> >> @@ -64,6 +83,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>> >> __bpf_kfunc_end_defs();
>> >>
>> >> BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
>> >> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
>> >
>> > I feel as though relying on KF_ACQUIRE semantics here is somewhat
>> > odd. Users of this BPF kfunc will now be forced to call
>> > bpf_put_mem_cgroup() on the returned root_mem_cgroup, despite it being
>> > completely unnecessary.
>>
>> A agree that it's annoying, but I doubt this extra call makes any
>> difference in the real world.
>
> Sure, that certainly holds true.
>
>> Also, the corresponding kernel code designed to hide the special
>> handling of the root cgroup. css_get()/css_put() are simple no-ops for
>> the root cgroup, but are totally valid.
>
> Yes, I do see that.
>
>> So in most places the root cgroup is handled as any other, which
>> simplifies the code. I guess the same will be true for many bpf
>> programs.
>
> I see, however the same might not necessarily hold for all other
> global pointers which end up being handed out by a BPF kfunc (not
> necessarily bpf_get_root_mem_cgroup()). This is why I was wondering
> whether there's some sense to introducing another KF flag (or
> something similar) which allows returned values from BPF kfuncs to be
> implicitly treated as trusted.
Agree. It sounds like a good idea to me.
© 2016 - 2026 Red Hat, Inc.