[v8] Pass external callchain entry to get_perf_callchain

[PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Tao Chen 1 week, 5 days ago

Record rctx inside the perf_callchain_entry itself, when callers of
get_callchain_entry no longer care about the assignment of rctx, and
will be used in the next patch.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
---
 include/linux/perf_event.h |  5 +++--
 kernel/bpf/stackmap.c      |  5 ++---
 kernel/events/callchain.c  | 27 ++++++++++++++++-----------
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9870d768db4..f0489843ebc 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -57,6 +57,7 @@
 #include <asm/local.h>
 
 struct perf_callchain_entry {
+	int				rctx;
 	u64				nr;
 	u64				ip[]; /* /proc/sys/kernel/perf_event_max_stack */
 };
@@ -1723,8 +1724,8 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 		   u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie);
 extern int get_callchain_buffers(int max_stack);
 extern void put_callchain_buffers(void);
-extern struct perf_callchain_entry *get_callchain_entry(int *rctx);
-extern void put_callchain_entry(int rctx);
+extern struct perf_callchain_entry *get_callchain_entry(void);
+extern void put_callchain_entry(struct perf_callchain_entry *entry);
 
 extern int sysctl_perf_event_max_stack;
 extern int sysctl_perf_event_max_contexts_per_stack;
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index da3d328f5c1..e77dcdc2164 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -214,9 +214,8 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 {
 #ifdef CONFIG_STACKTRACE
 	struct perf_callchain_entry *entry;
-	int rctx;
 
-	entry = get_callchain_entry(&rctx);
+	entry = get_callchain_entry();
 
 	if (!entry)
 		return NULL;
@@ -238,7 +237,7 @@ get_callchain_entry_for_task(struct task_struct *task, u32 max_depth)
 			to[i] = (u64)(from[i]);
 	}
 
-	put_callchain_entry(rctx);
+	put_callchain_entry(entry);
 
 	return entry;
 #else /* CONFIG_STACKTRACE */
diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c
index b9c7e00725d..6cdbc5937b1 100644
--- a/kernel/events/callchain.c
+++ b/kernel/events/callchain.c
@@ -151,31 +151,36 @@ void put_callchain_buffers(void)
 	}
 }
 
-struct perf_callchain_entry *get_callchain_entry(int *rctx)
+struct perf_callchain_entry *get_callchain_entry(void)
 {
 	int cpu;
+	int rctx;
 	struct callchain_cpus_entries *entries;
+	struct perf_callchain_entry *entry;
 
-	*rctx = get_recursion_context(this_cpu_ptr(callchain_recursion));
-	if (*rctx == -1)
+	rctx = get_recursion_context(this_cpu_ptr(callchain_recursion));
+	if (rctx == -1)
 		return NULL;
 
 	entries = rcu_dereference(callchain_cpus_entries);
 	if (!entries) {
-		put_recursion_context(this_cpu_ptr(callchain_recursion), *rctx);
+		put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
 		return NULL;
 	}
 
 	cpu = smp_processor_id();
 
-	return (((void *)entries->cpu_entries[cpu]) +
-		(*rctx * perf_callchain_entry__sizeof()));
+	entry = ((void *)entries->cpu_entries[cpu]) +
+		(rctx * perf_callchain_entry__sizeof());
+	entry->rctx = rctx;
+
+	return entry;
 }
 
 void
-put_callchain_entry(int rctx)
+put_callchain_entry(struct perf_callchain_entry *entry)
 {
-	put_recursion_context(this_cpu_ptr(callchain_recursion), rctx);
+	put_recursion_context(this_cpu_ptr(callchain_recursion), entry->rctx);
 }
 
 static void fixup_uretprobe_trampoline_entries(struct perf_callchain_entry *entry,
@@ -222,13 +227,13 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 {
 	struct perf_callchain_entry *entry;
 	struct perf_callchain_entry_ctx ctx;
-	int rctx, start_entry_idx;
+	int start_entry_idx;
 
 	/* crosstask is not supported for user stacks */
 	if (crosstask && user && !kernel)
 		return NULL;
 
-	entry = get_callchain_entry(&rctx);
+	entry = get_callchain_entry();
 	if (!entry)
 		return NULL;
 
@@ -272,7 +277,7 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, bool user,
 	}
 
 exit_put:
-	put_callchain_entry(rctx);
+	put_callchain_entry(entry);
 
 	return entry;
 }
-- 
2.48.1

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Peter Zijlstra 1 week, 3 days ago

On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
> Record rctx inside the perf_callchain_entry itself, when callers of
> get_callchain_entry no longer care about the assignment of rctx, and
> will be used in the next patch.

Sorry, what?

The recursion context is very much about the caller, and very much not
about the data. This just doesn't make sense.

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Tao Chen 1 week, 2 days ago

在 2026/1/28 16:59, Peter Zijlstra 写道:
> On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
>> Record rctx inside the perf_callchain_entry itself, when callers of
>> get_callchain_entry no longer care about the assignment of rctx, and
>> will be used in the next patch.
> 
> Sorry, what?
> 
> The recursion context is very much about the caller, and very much not
> about the data. This just doesn't make sense.

Well, Andrii, it seems we have to go back to the original way.

-- 
Best Regards
Tao Chen

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Andrii Nakryiko 1 week, 2 days ago

On Wed, Jan 28, 2026 at 8:53 AM Tao Chen <chen.dylane@linux.dev> wrote:
>
> 在 2026/1/28 16:59, Peter Zijlstra 写道:
> > On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
> >> Record rctx inside the perf_callchain_entry itself, when callers of
> >> get_callchain_entry no longer care about the assignment of rctx, and
> >> will be used in the next patch.
> >
> > Sorry, what?
> >
> > The recursion context is very much about the caller, and very much not
> > about the data. This just doesn't make sense.
>
> Well, Andrii, it seems we have to go back to the original way.

Huh, why? Peter is confused by your wording, he didn't say that what
you are doing is broken.

The point is to couple rctx with the exact perf_callchain_entry
returned to the caller in that recursion context. There is no
contradiction.

It's purely about simplifying the interface. While previously the
caller would have to juggle perf_callchain_entry and rctx separately
(but still ensure that entry and rctx do match each other when calling
put_callchain_entry), now it's more convenient and less error-prone
because returned entry will record rctx it was retrieved with.

That perf_callchain_entry should not be reused by anyone else between
successful get_callchain_entry and put_callchain_entry, so there is no
problem here.

>
> --
> Best Regards
> Tao Chen

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Tao Chen 1 week, 2 days ago

在 2026/1/29 02:59, Andrii Nakryiko 写道:
> On Wed, Jan 28, 2026 at 8:53 AM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>> 在 2026/1/28 16:59, Peter Zijlstra 写道:
>>> On Mon, Jan 26, 2026 at 03:43:29PM +0800, Tao Chen wrote:
>>>> Record rctx inside the perf_callchain_entry itself, when callers of
>>>> get_callchain_entry no longer care about the assignment of rctx, and
>>>> will be used in the next patch.
>>>
>>> Sorry, what?
>>>
>>> The recursion context is very much about the caller, and very much not
>>> about the data. This just doesn't make sense.
>>
>> Well, Andrii, it seems we have to go back to the original way.
> 
> Huh, why? Peter is confused by your wording, he didn't say that what
> you are doing is broken.
>

I see, will update the changelog, thanks.
> The point is to couple rctx with the exact perf_callchain_entry
> returned to the caller in that recursion context. There is no
> contradiction.
> 
> It's purely about simplifying the interface. While previously the
> caller would have to juggle perf_callchain_entry and rctx separately
> (but still ensure that entry and rctx do match each other when calling
> put_callchain_entry), now it's more convenient and less error-prone
> because returned entry will record rctx it was retrieved with.
> 
> That perf_callchain_entry should not be reused by anyone else between
> successful get_callchain_entry and put_callchain_entry, so there is no
> problem here.
> 
>>
>> --
>> Best Regards
>> Tao Chen


-- 
Best Regards
Tao Chen

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by bot+bpf-ci@kernel.org 1 week, 5 days ago

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 9ded2e582c60..0330f265173d 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -57,6 +57,7 @@
>  #include <asm/local.h>
>
>  struct perf_callchain_entry {
> +	int				rctx;
>  	u64				nr;
>  	u64				ip[]; /* /proc/sys/kernel/perf_event_max_stack */
>  };

Adding rctx before nr changes the structure layout in a way that
appears to break the perf userspace ABI.

In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
is set, the code does:

    int size = 1;
    size += data->callchain->nr;
    size *= sizeof(u64);
    __output_copy(handle, data->callchain, size);

This copies (1 + nr) * 8 bytes starting from the beginning of the
structure. Before this patch, offset 0 contained nr (u64), followed
by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
padding, then nr at offset 8, then ip[] at offset 16.

With the new layout, this copy sends the following to userspace:
- First 8 bytes: rctx + padding (kernel-internal data) instead of nr
- Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]

This means the callchain data sent to userspace is shifted by 8 bytes,
the last IP is truncated, and the rctx value is leaked to userspace.

Is there a plan to update perf_output_sample() to output starting from
&entry->nr instead of the structure start? Or am I missing something
about how this structure is used for userspace output?

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Tao Chen 1 week, 5 days ago

在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道:
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 9ded2e582c60..0330f265173d 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -57,6 +57,7 @@
>>   #include <asm/local.h>
>>
>>   struct perf_callchain_entry {
>> +	int				rctx;
>>   	u64				nr;
>>   	u64				ip[]; /* /proc/sys/kernel/perf_event_max_stack */
>>   };
> 
> Adding rctx before nr changes the structure layout in a way that
> appears to break the perf userspace ABI.
> 
> In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
> is set, the code does:
> 
>      int size = 1;
>      size += data->callchain->nr;
>      size *= sizeof(u64);
>      __output_copy(handle, data->callchain, size);
> 
> This copies (1 + nr) * 8 bytes starting from the beginning of the
> structure. Before this patch, offset 0 contained nr (u64), followed
> by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
> padding, then nr at offset 8, then ip[] at offset 16.
> 
> With the new layout, this copy sends the following to userspace:
> - First 8 bytes: rctx + padding (kernel-internal data) instead of nr
> - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]
> 
> This means the callchain data sent to userspace is shifted by 8 bytes,
> the last IP is truncated, and the rctx value is leaked to userspace.
> 
> Is there a plan to update perf_output_sample() to output starting from
> &entry->nr instead of the structure start? Or am I missing something
> about how this structure is used for userspace output?
> 

As AI said, this change shifts the memory layout, which effectively 
breaks the userspace ABI.

Maybe we can use __output_copy(handle, &data->callchain->nr, size);

> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831


-- 
Best Regards
Tao Chen

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Andrii Nakryiko 1 week, 3 days ago

On Mon, Jan 26, 2026 at 12:51 AM Tao Chen <chen.dylane@linux.dev> wrote:
>
> 在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道:
> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> >> index 9ded2e582c60..0330f265173d 100644
> >> --- a/include/linux/perf_event.h
> >> +++ b/include/linux/perf_event.h
> >> @@ -57,6 +57,7 @@
> >>   #include <asm/local.h>
> >>
> >>   struct perf_callchain_entry {
> >> +    int                             rctx;
> >>      u64                             nr;
> >>      u64                             ip[]; /* /proc/sys/kernel/perf_event_max_stack */
> >>   };
> >
> > Adding rctx before nr changes the structure layout in a way that
> > appears to break the perf userspace ABI.
> >
> > In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
> > is set, the code does:
> >
> >      int size = 1;
> >      size += data->callchain->nr;
> >      size *= sizeof(u64);
> >      __output_copy(handle, data->callchain, size);
> >
> > This copies (1 + nr) * 8 bytes starting from the beginning of the
> > structure. Before this patch, offset 0 contained nr (u64), followed
> > by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
> > padding, then nr at offset 8, then ip[] at offset 16.
> >
> > With the new layout, this copy sends the following to userspace:
> > - First 8 bytes: rctx + padding (kernel-internal data) instead of nr
> > - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]
> >
> > This means the callchain data sent to userspace is shifted by 8 bytes,
> > the last IP is truncated, and the rctx value is leaked to userspace.
> >
> > Is there a plan to update perf_output_sample() to output starting from
> > &entry->nr instead of the structure start? Or am I missing something
> > about how this structure is used for userspace output?
> >
>
> As AI said, this change shifts the memory layout, which effectively
> breaks the userspace ABI.
>
> Maybe we can use __output_copy(handle, &data->callchain->nr, size);

yep, very impressive for AI to notice this. I agree that
&data->callchain->nr seems like the best way forward.

>
> > [ ... ]
> >
> >
> > ---
> > AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> >
> > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831
>
>
> --
> Best Regards
> Tao Chen

Re: [PATCH bpf-next v8 1/3] perf: Add rctx in perf_callchain_entry

Posted by Tao Chen 1 week, 3 days ago

在 2026/1/28 05:01, Andrii Nakryiko 写道:
> On Mon, Jan 26, 2026 at 12:51 AM Tao Chen <chen.dylane@linux.dev> wrote:
>>
>> 在 2026/1/26 16:03, bot+bpf-ci@kernel.org 写道:
>>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>>> index 9ded2e582c60..0330f265173d 100644
>>>> --- a/include/linux/perf_event.h
>>>> +++ b/include/linux/perf_event.h
>>>> @@ -57,6 +57,7 @@
>>>>    #include <asm/local.h>
>>>>
>>>>    struct perf_callchain_entry {
>>>> +    int                             rctx;
>>>>       u64                             nr;
>>>>       u64                             ip[]; /* /proc/sys/kernel/perf_event_max_stack */
>>>>    };
>>>
>>> Adding rctx before nr changes the structure layout in a way that
>>> appears to break the perf userspace ABI.
>>>
>>> In perf_output_sample() in kernel/events/core.c, when PERF_SAMPLE_CALLCHAIN
>>> is set, the code does:
>>>
>>>       int size = 1;
>>>       size += data->callchain->nr;
>>>       size *= sizeof(u64);
>>>       __output_copy(handle, data->callchain, size);
>>>
>>> This copies (1 + nr) * 8 bytes starting from the beginning of the
>>> structure. Before this patch, offset 0 contained nr (u64), followed
>>> by ip[]. After this patch, offset 0 contains rctx (int) + 4 bytes
>>> padding, then nr at offset 8, then ip[] at offset 16.
>>>
>>> With the new layout, this copy sends the following to userspace:
>>> - First 8 bytes: rctx + padding (kernel-internal data) instead of nr
>>> - Remaining bytes: nr + ip[0..nr-2] instead of ip[0..nr-1]
>>>
>>> This means the callchain data sent to userspace is shifted by 8 bytes,
>>> the last IP is truncated, and the rctx value is leaked to userspace.
>>>
>>> Is there a plan to update perf_output_sample() to output starting from
>>> &entry->nr instead of the structure start? Or am I missing something
>>> about how this structure is used for userspace output?
>>>
>>
>> As AI said, this change shifts the memory layout, which effectively
>> breaks the userspace ABI.
>>
>> Maybe we can use __output_copy(handle, &data->callchain->nr, size);
> 
> yep, very impressive for AI to notice this. I agree that
> &data->callchain->nr seems like the best way forward.
>

will fix it in v9.
>>
>>> [ ... ]
>>>
>>>
>>> ---
>>> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
>>> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>>>
>>> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21350234831
>>
>>
>> --
>> Best Regards
>> Tao Chen


-- 
Best Regards
Tao Chen