unwind, perf: sframe user space unwinding, deferred perf callchains

[PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Josh Poimboeuf 1 year, 4 months ago

This is a new version of sframe user space unwinding + perf deferred
callchains.  Better late than never.

Unfortunately I didn't get a chance to do any testing with this one yet
as I'm rushing it out the door before GNU Cauldron starts.
Conference-driven development ;-)

Plus I don't have the perf tool piece working yet so I don't have a way
of doing end-to-end testing at the moment anyway.

In other words, please don't merge this yet.

Namhyung, if you're still available to write a perf tool patch which
integrates with this, that would be great.  Otherwise I could give it a
try.

Steven, let me know if this would interface ok with your anticipated
tracing usage.

v2:
- rebase on v6.11-rc7
- reorganize the patches to add sframe first
- change to sframe v2
- add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED
- add new perf attribute: defer_callchain

v1: https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org

Some distros have started compiling frame pointers into all their
packages to enable the kernel to do system-wide profiling of user space.
Unfortunately that creates a runtime performance penalty across the
entire system.  Using DWARF (or .eh_frame) instead isn't feasible
because of complexity and slowness.

For in-kernel unwinding we solved this problem with the creation of the
ORC unwinder for x86_64.  Similarly, for user space the GNU assembler
has created the SFrame ("Simple Frame") v2 format starting with binutils
2.41.

These patches add support for unwinding user space from the kernel using
SFrame with perf.  It should be easy to add user unwinding support for
other components like ftrace.

There were two main challenges:

1) Finding .sframe sections in shared/dlopened libraries

   The kernel has no visibility to the contents of shared libraries.
   This was solved by adding a PR_ADD_SFRAME option to prctl() which
   allows the runtime linker to manually provide the in-memory address
   of an .sframe section to the kernel.

2) Dealing with page faults

   Keeping all binaries' sframe data pinned would likely waste a lot of
   memory.  Instead, read it from user space on demand.  That can't be
   done from perf NMI context due to page faults, so defer the unwind to
   the next user exit.  Since the NMI handler doesn't do exit work,
   self-IPI and then schedule task work to be run on exit from the IPI.

Special thanks to Indu for the original concept, and to Steven and Peter
for helping a lot with the design.  And to Steven for letting me do it ;-)

Josh Poimboeuf (11):
  unwind: Introduce generic user space unwinding interface
  unwind/x86: Add HAVE_USER_UNWIND
  unwind: Introduce SFrame user space unwinding
  unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME
  perf/x86: Use user_unwind interface
  perf: Remove get_perf_callchain() 'init_nr' argument
  perf: Remove get_perf_callchain() 'crosstask' argument
  perf: Simplify get_perf_callchain() user logic
  perf: Introduce deferred user callchains
  perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED
  perf/x86: Enable SFrame unwinding for deferred user callchains

 arch/Kconfig                       |   9 +
 arch/x86/Kconfig                   |   3 +
 arch/x86/events/core.c             |  76 +++---
 arch/x86/include/asm/user_unwind.h |  11 +
 fs/binfmt_elf.c                    |  47 +++-
 include/linux/mm_types.h           |   3 +
 include/linux/perf_event.h         |  11 +-
 include/linux/sframe.h             |  46 ++++
 include/linux/user_unwind.h        |  32 +++
 include/uapi/linux/elf.h           |   1 +
 include/uapi/linux/perf_event.h    |  21 +-
 include/uapi/linux/prctl.h         |   3 +
 kernel/Makefile                    |   1 +
 kernel/bpf/stackmap.c              |   8 +-
 kernel/events/callchain.c          |  48 ++--
 kernel/events/core.c               |  82 +++++-
 kernel/fork.c                      |  10 +
 kernel/sys.c                       |  11 +
 kernel/unwind/Makefile             |   2 +
 kernel/unwind/sframe.c             | 420 +++++++++++++++++++++++++++++
 kernel/unwind/sframe.h             | 215 +++++++++++++++
 kernel/unwind/user.c               |  95 +++++++
 mm/init-mm.c                       |   4 +-
 23 files changed, 1086 insertions(+), 73 deletions(-)
 create mode 100644 arch/x86/include/asm/user_unwind.h
 create mode 100644 include/linux/sframe.h
 create mode 100644 include/linux/user_unwind.h
 create mode 100644 kernel/unwind/Makefile
 create mode 100644 kernel/unwind/sframe.c
 create mode 100644 kernel/unwind/sframe.h
 create mode 100644 kernel/unwind/user.c

-- 
2.46.0

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Namhyung Kim 1 year, 4 months ago

Hello Josh,

On Sat, Sep 14, 2024 at 01:02:02AM +0200, Josh Poimboeuf wrote:
> This is a new version of sframe user space unwinding + perf deferred
> callchains.  Better late than never.
> 
> Unfortunately I didn't get a chance to do any testing with this one yet
> as I'm rushing it out the door before GNU Cauldron starts.
> Conference-driven development ;-)

lol

> 
> Plus I don't have the perf tool piece working yet so I don't have a way
> of doing end-to-end testing at the moment anyway.
> 
> In other words, please don't merge this yet.
> 
> Namhyung, if you're still available to write a perf tool patch which
> integrates with this, that would be great.  Otherwise I could give it a
> try.

Sure, I'll take a look and post the perf tools patches.

Thanks,
Namhyung

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Steven Rostedt 1 year, 4 months ago

On Sat, 14 Sep 2024 01:02:02 +0200
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> This is a new version of sframe user space unwinding + perf deferred
> callchains.  Better late than never.
> 
> Unfortunately I didn't get a chance to do any testing with this one yet
> as I'm rushing it out the door before GNU Cauldron starts.
> Conference-driven development ;-)

Thanks for posting this!

BTW, next time, can you also Cc linux-trace-kernel@vger.kerne.org.

> 
> Plus I don't have the perf tool piece working yet so I don't have a way
> of doing end-to-end testing at the moment anyway.
> 
> In other words, please don't merge this yet.
> 
> Namhyung, if you're still available to write a perf tool patch which
> integrates with this, that would be great.  Otherwise I could give it a
> try.
> 
> Steven, let me know if this would interface ok with your anticipated
> tracing usage.

This is a very good starting point for me. A couple of notes.

I think the unwinder should have an interface itself that provides the
deferred unwinding, instead of having all the tracers to implement
their own.

The user space unwinder (regardless of sframe or not) should provide a
way to say "I want a user space stack trace for here, but don't do it
yet. Just give me a cookie and then call my callback function with the
stack and cookie before going back to user space".

That is, we should have an interface like:

typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);

struct unwinder_client {
	struct list_head	list;
	unwinder_callback_t	callback;
};

struct unwinder_client my_unwinder = {
	.callback = mycallback;
};

	register_unwinder(&my_unwinder);

Then a tracer could do this in NMI context:

	/* Get the current cookie of the current task */
	cookie = unwinder_get_cookie();

	/* Tell the unwinder to call the callbacks */
	unwinder_trigger();

	/* then the tracer can save off that cookie, do a kernel stack trace
	 * and save the cookie with that stack trace.
	 */

On return to user space, it will go the ptrace slow path and call all
registered callbacks with the cookie, regardless if it asked for it or
not.

	/* Save the stack somewhere, could even allocate memory to do so */
	list_for_each_entry(callback, &unwinder_callback, list) {
		callback->callback(userstack, cookie);
	}
	update_next_cookie();

The cookie is a unique number for every entry into the kernel by all
tasks. Then it is up to the tracer's callback to know if it should save
the trace or ignore it. That is, if perf and ftrace are expecting
callbacks, but only perf triggered the callback, it should record the
cookie, and save the stack if it matches a cookie it wants. ftrace's
callback would ignore the trace, because it did not ask for a callback.
That is, the tracer will need to save off the cookie to know if its
callback needs to do something or not.

The cookie can be generated by this:

static u64 next_cookie;

u64 unwinder_get_cookie(void)
{
	u64 next;

	if (current->unwinder_cookie)
		return current->unwinder_cookie;

	next = next_cookie;
	while (!try_cmpxchg64(&next_cookie, &next, next + 1))
		;

	next++;

	/* Check for races with NMI */
	if (!cmpxchg(&current->unwinder_cookie, 0, next))
		next = current->unwinder_cookie;

	return next;
}

When the task goes back to user space and does all the callbacks, it
would reset the task cookie.

	current->unwinder_cookie = 0;

Now the loop of callbacks should also catch if unwinder_trigger() was
called again (NMI), and may need to recall all the back traces again
with the new cookie.

That will need some thought about handling that race.

This is just an idea to start discussion.

-- Steve

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Josh Poimboeuf 1 year, 4 months ago

On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> I think the unwinder should have an interface itself that provides the
> deferred unwinding, instead of having all the tracers to implement
> their own.
> 
> The user space unwinder (regardless of sframe or not) should provide a
> way to say "I want a user space stack trace for here, but don't do it
> yet. Just give me a cookie and then call my callback function with the
> stack and cookie before going back to user space".

We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
I think we're in basic agreement on this.

I think the biggest tweak we decided on is that the context id (aka
"cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
incremented for every entry from user space.

> That is, we should have an interface like:
> 
> typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);

> struct unwinder_client {
> 	struct list_head	list;
> 	unwinder_callback_t	callback;

I assume we want to allow tracers to pick sframes or FP (or auto).  If
so we would need to add a user_unwind_type enum to this struct.

Then the question is, what to do if tracer A wants sframe and tracer B
wants FP?  I'm thinking it's fine to allow that.  I assume the "multiple
tracers unwinding user space" case isn't realistic so any extra overhead
from these cases is the user's fault?

The unwinder would need two sets of callbacks and task work functions:
one for sframe, one for FP.

The tracer would need to pass its &my_unwinder struct to
unwinder_trigger() so the unwinder knows which task work function to
activate.

I thinking we also need a 'max_entries' field.  No need to keep
unwinding if we've already reached the tracer's max.  If there are
multiple callbacks then we'd have to get the max of the maxes, but
that's easy enough.

Mathieu had requested passing an opaque void * from the NMI to the
callback, but I don't think that's possible with this scheme since the
unwinder wouldn't know which callback to give it to.  Instead the tracer
would have to keep track of its own data associated with the given
cookie.

-- 
Josh

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Peter Zijlstra 1 year, 4 months ago

On Sun, Sep 15, 2024 at 01:11:11PM +0200, Josh Poimboeuf wrote:
> On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> > I think the unwinder should have an interface itself that provides the
> > deferred unwinding, instead of having all the tracers to implement
> > their own.
> > 
> > The user space unwinder (regardless of sframe or not) should provide a
> > way to say "I want a user space stack trace for here, but don't do it
> > yet. Just give me a cookie and then call my callback function with the
> > stack and cookie before going back to user space".
> 
> We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
> I think we're in basic agreement on this.
> 
> I think the biggest tweak we decided on is that the context id (aka
> "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> incremented for every entry from user space.

Why? What's the purpose of the cookie? This scheme seems unsound, pin
yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.

> > That is, we should have an interface like:
> > 
> > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);

Just make it a void* and let the consumer figure it out.

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Steven Rostedt 1 year, 4 months ago

On Mon, 16 Sep 2024 16:08:56 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > 
> > I think the biggest tweak we decided on is that the context id (aka
> > "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> > incremented for every entry from user space.  
> 
> Why? What's the purpose of the cookie? This scheme seems unsound, pin
> yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.
> 
> > > That is, we should have an interface like:
> > > 
> > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);  
> 
> Just make it a void* and let the consumer figure it out.

Let me try to explain the rationale for the "cookie". It's actually a "context_cookie".
The cookie is unique for every time user space enters the kernel.

Basically it's a "tag" (we could call it that too, but "cookie" seemed
more appropriate as it follows what web browsers do). The idea is this:

 * You're in interrupt context and want a user space back trace, and
   request to have a user stack trace when the task goes back to user
   space. Now the unwinder will return the current context cookie" and
   will use that same cookie when it creates the user stack on exit
   back to user space.

* If this happens in a long system call that does many stack traces
  (records the kernel stack and also wants the user stack), it does not
  record the user trace at the time it is requested. But we need a way
  to tag this event with the user stack trace that will happen when the
  user stack is recorded when going back to user space. We can't pin
  the task to the CPU when we want to profile it. The cookie/tag needs
  to be the same for every request that happens with a single entry
  into the kernel. It must be different for another entry into the
  kernel as the user stack will be different then.

  Basically think that each cookie represents a single user stack.

* When going back to user space, the ptrace path is taken and the user
  stack trace is recorded. It will then call the tracers that requested
  it with the stack trace and the cookie that represents it.

Here's an example:

system_call() {

	<interrupt> trigger user stack trace: assign cookie 123
		tracer records kernel stack trace and adds cookie 123.

	<interrupt> trigger user stack trace: assign cookie 123
		tracer records kernel stack trace and adds cookie 123.

	<interrupt> trigger user stack trace: assign cookie 123
		tracer records kernel stack trace and adds cookie 123.

	<return to user space, take ptrace path>

	<record user stack trace>
	call all the tracers with user stack trace with cookie 123.
}

system_call() {

	<interrupt> trigger user stack trace: assign cookie 124
		tracer records kernel stack trace and adds cookie 124.

	<return to user space, take ptrace path>

	<record user stack trace>
	call all the tracers with user stack trace with cookie 124.
}

Then the tracers can post process the events and append the user stack
trace with cookie 123 on top of the kernel stack events that had cookie
123, and append the user stack trace with cookie 124 on top of the
kernel stack events that had cookie 124.

-- Steve

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Josh Poimboeuf 1 year, 4 months ago

On Mon, Sep 16, 2024 at 04:08:56PM +0200, Peter Zijlstra wrote:
> On Sun, Sep 15, 2024 at 01:11:11PM +0200, Josh Poimboeuf wrote:
> > On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> > > I think the unwinder should have an interface itself that provides the
> > > deferred unwinding, instead of having all the tracers to implement
> > > their own.
> > > 
> > > The user space unwinder (regardless of sframe or not) should provide a
> > > way to say "I want a user space stack trace for here, but don't do it
> > > yet. Just give me a cookie and then call my callback function with the
> > > stack and cookie before going back to user space".
> > 
> > We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
> > I think we're in basic agreement on this.
> > 
> > I think the biggest tweak we decided on is that the context id (aka
> > "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> > incremented for every entry from user space.
> 
> Why? What's the purpose of the cookie?

The cookie is incremented once per entry from userspace, when needed.

It's a unique id used by the tracers which is emitted with both kernel
and user stacktrace samples so the userspace tool can stitch them back
together.  Even if you have multiple tracers triggering at the same time
they can all share a single user trace.

Completely untested patch here that I hacked up today:

 git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-2.1

To avoid races, unwind_user_deferred() can't be called from NMI.  The
tracers will need to trigger irq_work to call it.

> This scheme seems unsound

It may not be communicated well in this thread.  We had much discussion
at Cauldron the last few days.  The patch may be the best way to
communicate it at this point, though I'm sure I missed some details.

> pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.

Hm???

> > > That is, we should have an interface like:
> > > 
> > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);
> 
> Just make it a void* and let the consumer figure it out.

I think it makes sense to manage the cookie in the deferred unwinder
since the need to correlate the stack traces is an inherent part of the
deferral.

-- 
Josh

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Steven Rostedt 1 year, 4 months ago

On Mon, 16 Sep 2024 17:39:53 +0200
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> Completely untested patch here that I hacked up today:

I can tell it wasn't tested ;-)

> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-2.1
> 
> To avoid races, unwind_user_deferred() can't be called from NMI.  The
> tracers will need to trigger irq_work to call it.

diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c
index 7edb0833fe46..c56cf5d564df 100644
--- a/kernel/unwind/user.c
+++ b/kernel/unwind/user.c
@@ -177,7 +181,8 @@ int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie)
 
 		cookie = __this_cpu_read(ctx_ctr);
 		cookie &= ((1UL << 48) - 1);
-		cookie |= ((cpu << 48) + 1);
+		cookie |= cpu << 48;
+		cookie++;
 		__this_cpu_write(ctx_ctr, cookie);
 
 		current->unwind_ctx_cookie = cookie;

As the cookie never got incremented.

That was just one issue. Things are still not working but I'll debug the
rest later.

-- Steve

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Josh Poimboeuf 1 year, 4 months ago

On Wed, Oct 02, 2024 at 10:31:25PM -0400, Steven Rostedt wrote:
> +++ b/kernel/unwind/user.c
> @@ -177,7 +181,8 @@ int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie)
>  
>  		cookie = __this_cpu_read(ctx_ctr);
>  		cookie &= ((1UL << 48) - 1);
> -		cookie |= ((cpu << 48) + 1);
> +		cookie |= cpu << 48;
> +		cookie++;
>  		__this_cpu_write(ctx_ctr, cookie);
>  
>  		current->unwind_ctx_cookie = cookie;
> 
> As the cookie never got incremented.

Ha, that might help ;-)

> That was just one issue. Things are still not working but I'll debug the
> rest later.

If you don't want to run on the bleeding edge, I'll be testing it soon
myself once I get the coding done for my v3.

-- 
Josh

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Steven Rostedt 1 year, 4 months ago

On Wed, 2 Oct 2024 19:37:02 -0700
Josh Poimboeuf <jpoimboe@kernel.org> wrote:

> > That was just one issue. Things are still not working but I'll debug the
> > rest later.  
> 
> If you don't want to run on the bleeding edge, I'll be testing it soon
> myself once I get the coding done for my v3.

Yeah, I'll wait for that. I have some code working for ftrace on top of
this, but I also promised Mathieu I would start looking at his faultable
trace points, so I should do that before working more on sframe.

-- Steve

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Peter Zijlstra 1 year, 4 months ago

On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:

> The cookie is incremented once per entry from userspace, when needed.
> 
> It's a unique id used by the tracers which is emitted with both kernel
> and user stacktrace samples so the userspace tool can stitch them back
> together.  Even if you have multiple tracers triggering at the same time
> they can all share a single user trace.

But perf don't need this at all, right? It knows that the next deferred
trace for that tid will be the one.

> > This scheme seems unsound

> > pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.
> 
> Hm???

The point being that it is possible to wrap one CPU into the id space of
another CPU. It is not trivial, but someone who wants to can make it
happen.

Combined I don't see the need to force this into this scheme, carry an
opaque (void*) pointer and let the user do whatever it wants, perf for
instance can pass NULL and not do anything like this.

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Mathieu Desnoyers 1 year, 4 months ago

On 2024-09-16 20:15, Peter Zijlstra wrote:
[...]
> The point being that it is possible to wrap one CPU into the id space of
> another CPU. It is not trivial, but someone who wants to can make it
> happen.

I agree that the overflow of the free-running counter bleeding into the
CPU numbers is something we want to prevent. We don't care if this
counter overflows after day/months/years for sake of correlation
within a system call, but we do care about the fact that this
free-running counter could be made to overflow after a very long
time while the system runs, and then we reach a state where the
CPU numbers are mixed up, which leads to short-term correlation
issues.

I would recommend this layout for this 64-bit value instead:

low-bits: cpu number
high-bits: free-running counter

This way, we eliminate any carry from overflow into the cpu number bits.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Mathieu Desnoyers 1 year, 4 months ago

On 2024-09-16 02:15, Mathieu Desnoyers wrote:
> On 2024-09-16 20:15, Peter Zijlstra wrote:
> [...]
>> The point being that it is possible to wrap one CPU into the id space of
>> another CPU. It is not trivial, but someone who wants to can make it
>> happen.
> 
> I agree that the overflow of the free-running counter bleeding into the
> CPU numbers is something we want to prevent. We don't care if this
> counter overflows after day/months/years for sake of correlation
> within a system call, but we do care about the fact that this
> free-running counter could be made to overflow after a very long
> time while the system runs, and then we reach a state where the
> CPU numbers are mixed up, which leads to short-term correlation
> issues.
> 
> I would recommend this layout for this 64-bit value instead:
> 
> low-bits: cpu number
> high-bits: free-running counter
> 
> This way, we eliminate any carry from overflow into the cpu number bits.

Even better: AFAIR from the discussion I had with Steven and Josh, we intend
to have the cookie stored to/read from the task struct with interrupts off,
we can simply do:

struct stackwalk_cookie {
	uint64_t counter;	/* free running per-cpu counter value */
	int cpu;		/* cpu on which the counter was snapshot. */
};

So we don't have to deal with any overflow trickiness, there is no need for
bit-shifting, and we get a full 64-bit free-running counter independently of
the number of CPUs.

Thanks,

Mathieu



> 
> Thanks,
> 
> Mathieu
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Mathieu Desnoyers 1 year, 4 months ago

On 2024-09-16 02:33, Mathieu Desnoyers wrote:
> On 2024-09-16 02:15, Mathieu Desnoyers wrote:
>> On 2024-09-16 20:15, Peter Zijlstra wrote:
>> [...]
>>> The point being that it is possible to wrap one CPU into the id space of
>>> another CPU. It is not trivial, but someone who wants to can make it
>>> happen.
>>
>> I agree that the overflow of the free-running counter bleeding into the
>> CPU numbers is something we want to prevent. We don't care if this
>> counter overflows after day/months/years for sake of correlation
>> within a system call, but we do care about the fact that this
>> free-running counter could be made to overflow after a very long
>> time while the system runs, and then we reach a state where the
>> CPU numbers are mixed up, which leads to short-term correlation
>> issues.
>>
>> I would recommend this layout for this 64-bit value instead:
>>
>> low-bits: cpu number
>> high-bits: free-running counter
>>
>> This way, we eliminate any carry from overflow into the cpu number bits.
> 
> Even better: AFAIR from the discussion I had with Steven and Josh, we 
> intend
> to have the cookie stored to/read from the task struct with interrupts off,
> we can simply do:
> 
> struct stackwalk_cookie {
>      uint64_t counter;    /* free running per-cpu counter value */
>      int cpu;        /* cpu on which the counter was snapshot. */
> };
> 
> So we don't have to deal with any overflow trickiness, there is no need for
> bit-shifting, and we get a full 64-bit free-running counter 
> independently of
> the number of CPUs.

Sorry my laptop had the wrong date when I sent this email. Re-sending to
make sure it gets seen in the correct order time-wise.

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
> 
>>
>> Thanks,
>>
>> Mathieu
>>
>>
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Steven Rostedt 1 year, 4 months ago

On Mon, 16 Sep 2024 20:15:45 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:
> 
> > The cookie is incremented once per entry from userspace, when needed.
> > 
> > It's a unique id used by the tracers which is emitted with both kernel
> > and user stacktrace samples so the userspace tool can stitch them back
> > together.  Even if you have multiple tracers triggering at the same time
> > they can all share a single user trace.  
> 
> But perf don't need this at all, right? It knows that the next deferred
> trace for that tid will be the one.

Is that because perf uses per task buffers? Will perf know this if it
uses a global buffer? What does perf do with "-a"?

> 
> > > This scheme seems unsound  
> 
> > > pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle.  
> > 
> > Hm???  
> 
> The point being that it is possible to wrap one CPU into the id space of
> another CPU. It is not trivial, but someone who wants to can make it
> happen.
> 
> Combined I don't see the need to force this into this scheme, carry an
> opaque (void*) pointer and let the user do whatever it wants, perf for
> instance can pass NULL and not do anything like this.

That would put a huge burden on those tracers. Specifically ftrace.
Just because perf doesn't need it, it doesn't mean it can't be part of
the infrastructure. The cookie isn't even per-cpu. The counter is just
generated per cpu to avoid races. The cookie will have the CPU as part
of the bits, plus the per-cpu counter. This will give us a unique value
across all CPUs without having to do any atomic operations. It's just
that the cookie must be unique for every time a task enters the kernel,
and stay the same while it is in the kernel.

So if the task enters the kernel on CPU 1 and a trace is requested, the
tracer still needs to save that event when the request occurred (for
example, the kernel stack trace). It needs a unique identifier that
will be attached to the user stack trace when it is made, so that post
processing can tell where to append that stack.

Then the task can migrate to several other CPUs, but it will still have
the same cookie as when it received it on the first CPU. The cookie
only gets updated on the task when it goes back to user space.

ftrace allows for several instances that could be filtering on the same
or different tasks. One instance could filter 3 tasks, another 2 tasks,
where one of those tasks is in both instance. The user stack
infrastructure should give a way to easily map the request to the stack
via some kind of identifier.

What would perf do on dropped events? Say a system call happened, perf
requested a user stack trace, then the events were dropped, and when it
started again, it was in a system call, how would perf know it is in
the same system call? I guess the callback when the dropped user stack
trace occurred would tell perf to reset. Still it adds a pretty big
burden for the tracers to handle these corner cases whereas a simple
identifier is much more simple and robust.

-- Steve

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Namhyung Kim 1 year, 4 months ago

Hello,

On Mon, Sep 16, 2024 at 06:46:45PM -0400, Steven Rostedt wrote:
> On Mon, 16 Sep 2024 20:15:45 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:
> > 
> > > The cookie is incremented once per entry from userspace, when needed.
> > > 
> > > It's a unique id used by the tracers which is emitted with both kernel
> > > and user stacktrace samples so the userspace tool can stitch them back
> > > together.  Even if you have multiple tracers triggering at the same time
> > > they can all share a single user trace.  
> > 
> > But perf don't need this at all, right? It knows that the next deferred
> > trace for that tid will be the one.

Well.. technically you can sample without tid.  But I'm not sure how
much it'd be useful if you collect callchains without tid.

> 
> Is that because perf uses per task buffers? Will perf know this if it
> uses a global buffer? What does perf do with "-a"?

Then it'd use per-cpu ring buffers.  But each sample would have pid/tid
pair and time so perf tools can match it with a deferred callchian.

Thanks,
Namhyung

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Mathieu Desnoyers 1 year, 4 months ago

On 2024-09-17 23:58, Namhyung Kim wrote:
> Hello,
> 
> On Mon, Sep 16, 2024 at 06:46:45PM -0400, Steven Rostedt wrote:
>> On Mon, 16 Sep 2024 20:15:45 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote:
>>>
>>>> The cookie is incremented once per entry from userspace, when needed.
>>>>
>>>> It's a unique id used by the tracers which is emitted with both kernel
>>>> and user stacktrace samples so the userspace tool can stitch them back
>>>> together.  Even if you have multiple tracers triggering at the same time
>>>> they can all share a single user trace.
>>>
>>> But perf don't need this at all, right? It knows that the next deferred
>>> trace for that tid will be the one.
> 
> Well.. technically you can sample without tid.  But I'm not sure how
> much it'd be useful if you collect callchains without tid.
> 
>>
>> Is that because perf uses per task buffers? Will perf know this if it
>> uses a global buffer? What does perf do with "-a"?
> 
> Then it'd use per-cpu ring buffers.  But each sample would have pid/tid
> pair and time so perf tools can match it with a deferred callchian.

That semantic correlation based on trace information should work fine if
you do not miss important events in the trace. What the unique id cookie
provides is robustness against confusion that can arise when events are
discarded.

Discarded events happen when the event throughput is too high for the
ring buffer to handle. The following type of confusion can then arise:

- If you miss a stack trace event and then a stack-sample-request
event, then the post-processing tools can incorrectly infer causality
between a prior stack-sample-request and a following stack trace for
the same tid.

- If you miss even more information about the end/beginning of lifetime
of two threads, post-processing can be confused and associate a prior
stack-sample-request from tid=N with a later stack trace for tid=N where
N was re-used due to an exit/clone sequence.

Saving the unique id cookie along with the stack-sample-request and with
the stack trace allows more robust causality relationship between the
two. Showing reliable causality information may not be super important
for profiling use-cases, but it is important for event tracers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Steven Rostedt 1 year, 4 months ago

On Sun, 15 Sep 2024 13:11:11 +0200
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote:
> > I think the unwinder should have an interface itself that provides the
> > deferred unwinding, instead of having all the tracers to implement
> > their own.
> > 
> > The user space unwinder (regardless of sframe or not) should provide a
> > way to say "I want a user space stack trace for here, but don't do it
> > yet. Just give me a cookie and then call my callback function with the
> > stack and cookie before going back to user space".  
> 
> We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and
> I think we're in basic agreement on this.
> 
> I think the biggest tweak we decided on is that the context id (aka
> "cookie") would be percpu.  Its initial value is (cpuid << 48).  It gets
> incremented for every entry from user space.

Well, in concept it is incremented for every entry from user space, but
in reality it should only be incremented the first time it is
referenced when coming into user space. That is, if there's nothing
asking for a stack trace, then nothing should be incremented.

> 
> > That is, we should have an interface like:
> > 
> > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie);  
> 
> > struct unwinder_client {
> > 	struct list_head	list;
> > 	unwinder_callback_t	callback;  
> 
> I assume we want to allow tracers to pick sframes or FP (or auto).  If
> so we would need to add a user_unwind_type enum to this struct.

Actually, I don't think they should care. A stack trace should just be
passed to the callback. How that stack trace is created, is up to the
unwinder. sframe should be used if it's avaliable, otherwise it will
use the frame pointer. If neither is available, the callback could be
called with NULL for the stack trace and then it can figure out what to
do via pt_regs.

> 
> Then the question is, what to do if tracer A wants sframe and tracer B
> wants FP?  I'm thinking it's fine to allow that.  I assume the "multiple
> tracers unwinding user space" case isn't realistic so any extra overhead
> from these cases is the user's fault?
> 
> The unwinder would need two sets of callbacks and task work functions:
> one for sframe, one for FP.
> 
> The tracer would need to pass its &my_unwinder struct to
> unwinder_trigger() so the unwinder knows which task work function to
> activate.
> 
> 
> I thinking we also need a 'max_entries' field.  No need to keep
> unwinding if we've already reached the tracer's max.  If there are
> multiple callbacks then we'd have to get the max of the maxes, but
> that's easy enough.

We could set a max when the tracer registers a callback. It can be -1
to mean "no limit".

I was talking with Mathieu at lunch, and we talked about having the
stack information pointer be part of the task_struct.

That is, the first time the unwinder is triggered and its doing the
unwind (either sfarme or fp), it will allocate memory to store the
stack trace information. This information is what is sent to the
tracer's callbacks. Again, the tracers do not care how the stack frame
was created. This memory is then saved in the task struct (freed when
the task exits), and reused for future stack traces.

If a stack trace could not be created, the callback just gets NULL and
that will tell the tracer that nothing is available for the user space
stack trace, and it can try to do something else. But that's not some
the unwinder should care about.

> 
> 
> Mathieu had requested passing an opaque void * from the NMI to the
> callback, but I don't think that's possible with this scheme since the
> unwinder wouldn't know which callback to give it to.  Instead the tracer
> would have to keep track of its own data associated with the given
> cookie.
> 

Not sure what that means, but from NMI, I would see the tracer doing:

	u64 cookie;

	cookie = undwinder_trigger();

And then the tracer is responsible to do something with that cookie.
This would also trigger the "task_work()"  where the ptrace path is
entered when going back to user space, the unwinder will try to create a
stack trace (from sframe or fp), saving it in this allocated memory
that is then stored via a pointer on the task_struct so it only needs
to be allocated once (or if a stack trace is bigger than what was
allocated).

Then all the tracer callbacks will be called with the stack trace and
the cookie. The tracer will decided if it should use the stack trace
from the cookie, or just ignore it. But the tracer shouldn't care about
sframes or frame pointers. That's an implementation detail of the
unwinder.

-- Steve

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Jens Remus 1 year, 3 months ago

Hello Josh!

On 14.09.2024 01:02, Josh Poimboeuf wrote:
> This is a new version of sframe user space unwinding + perf deferred
> callchains.  Better late than never.

...

> These patches add support for unwinding user space from the kernel using
> SFrame with perf.  It should be easy to add user unwinding support for
> other components like ftrace.

...

We are looking forward to implement support for unwinding of user space
using SFrame in kernel/perf on s390. One major concern is that your x86
implementation currently relies on a fallback to unwinding using frame
pointer. On s390 unwinding using frame pointer is unsupported, because
of lack of proper s390x ABI [1] specification and compiler support. In
theory there would be a s390-specific alternative of unwinding using
backchain (compiler option -mbackchain), but this has limitations and
there is currently no distribution where user space is built with
backchain.

How much of an issue would it be if s390 could not provide an unwinding
fallback implementation? Do you see the possibility to get away without?

For s390 support of unwinding using SFrame we would need to make a few
changes to your generic unwinding framework in the kernel:

- Support for architectures that do not define CFA == SP at callsite:
   On s390 the CFA is defined as SP at callsite +160. The stack pointer
   (SP) therefore unwinds as SP = CFA - 160. For that we would introduce
   e.g. a sp_val_off field (SP value offset from CFA) in struct
   user_unwind_frame that would default to 0 on all architectures except
   s390.

- Support for architectures where RA is not necessarily saved on stack:
   On s390 the return address (RA) is not saved (on stack) at function
   entry. In leaf functions it is not necessarily saved at all.

- Support for architectures were RA/FP are saved in registers in leaf
   functions:
   On s390 the frame pointer (FP) and return address (RA) registers can
   be saved in other registers when in leaf functions. In the SFrame
   format we would encode the DWARF register number as offset using the
   least-significant bit as indication: offset = (regnum << 1) | 1.
   Therefore we would need to extend your generic unwinding framework to
   support FP and RA to be restored from registers.

[1]: s390x ELF ABI supplement,
      https://github.com/IBM/s390x-abi/releases

Thanks and regards,
Jens
-- 
Jens Remus
Linux on Z Development (D3303) and z/VSE Support
+49-7031-16-1128 Office
jremus@de.ibm.com

IBM

IBM Deutschland Research & Development GmbH; Vorsitzender des 
Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der 
Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294
IBM Data Privacy Statement: https://www.ibm.com/privacy/

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Josh Poimboeuf 1 year, 3 months ago

On Wed, Oct 23, 2024 at 03:22:35PM +0200, Jens Remus wrote:
> We are looking forward to implement support for unwinding of user space
> using SFrame in kernel/perf on s390. One major concern is that your x86
> implementation currently relies on a fallback to unwinding using frame
> pointer. On s390 unwinding using frame pointer is unsupported, because
> of lack of proper s390x ABI [1] specification and compiler support. In
> theory there would be a s390-specific alternative of unwinding using
> backchain (compiler option -mbackchain), but this has limitations and
> there is currently no distribution where user space is built with
> backchain.
> 
> How much of an issue would it be if s390 could not provide an unwinding
> fallback implementation? Do you see the possibility to get away without?

No problem, I'll make the fallback dependent on CONFIG_HAVE_UNWIND_USER_FP.

> For s390 support of unwinding using SFrame we would need to make a few
> changes to your generic unwinding framework in the kernel:

Also no problem, I'm sure there will need to be tweaks going forward.

Thanks for looking at it!  v3 will be posted soon.

-- 
Josh

Re: [PATCH v2 00/11] unwind, perf: sframe user space unwinding, deferred perf callchains

Posted by Steven Rostedt 1 year, 3 months ago

On Wed, 23 Oct 2024 15:22:35 +0200
Jens Remus <jremus@linux.ibm.com> wrote:

> We are looking forward to implement support for unwinding of user space
> using SFrame in kernel/perf on s390. One major concern is that your x86
> implementation currently relies on a fallback to unwinding using frame
> pointer. On s390 unwinding using frame pointer is unsupported, because
> of lack of proper s390x ABI [1] specification and compiler support. In
> theory there would be a s390-specific alternative of unwinding using
> backchain (compiler option -mbackchain), but this has limitations and
> there is currently no distribution where user space is built with
> backchain.
> 
> How much of an issue would it be if s390 could not provide an unwinding
> fallback implementation? Do you see the possibility to get away without?

Yes. Even with x86, there's no guarantee that the applications will
have frame pointers available. Basically it just returns a stack frame
of one (the IP of where user space entered the kernel).

> 
> For s390 support of unwinding using SFrame we would need to make a few
> changes to your generic unwinding framework in the kernel:
> 
> - Support for architectures that do not define CFA == SP at callsite:
>    On s390 the CFA is defined as SP at callsite +160. The stack pointer
>    (SP) therefore unwinds as SP = CFA - 160. For that we would introduce
>    e.g. a sp_val_off field (SP value offset from CFA) in struct
>    user_unwind_frame that would default to 0 on all architectures except
>    s390.
> 
> - Support for architectures where RA is not necessarily saved on stack:
>    On s390 the return address (RA) is not saved (on stack) at function
>    entry. In leaf functions it is not necessarily saved at all.
> 
> - Support for architectures were RA/FP are saved in registers in leaf
>    functions:
>    On s390 the frame pointer (FP) and return address (RA) registers can
>    be saved in other registers when in leaf functions. In the SFrame
>    format we would encode the DWARF register number as offset using the
>    least-significant bit as indication: offset = (regnum << 1) | 1.
>    Therefore we would need to extend your generic unwinding framework to
>    support FP and RA to be restored from registers.
> 
> [1]: s390x ELF ABI supplement,
>       https://github.com/IBM/s390x-abi/releases

Note that Indu (who's on the Cc and is also the author of sframes) gave
a talk at GNU Cauldron about s390 support. I'm assuming that her new
sframe specification will cover all of this. Then we will have to
implement it.

-- Steve