arch/Kconfig | 9 + arch/x86/Kconfig | 3 + arch/x86/events/core.c | 76 +++--- arch/x86/include/asm/user_unwind.h | 11 + fs/binfmt_elf.c | 47 +++- include/linux/mm_types.h | 3 + include/linux/perf_event.h | 11 +- include/linux/sframe.h | 46 ++++ include/linux/user_unwind.h | 32 +++ include/uapi/linux/elf.h | 1 + include/uapi/linux/perf_event.h | 21 +- include/uapi/linux/prctl.h | 3 + kernel/Makefile | 1 + kernel/bpf/stackmap.c | 8 +- kernel/events/callchain.c | 48 ++-- kernel/events/core.c | 82 +++++- kernel/fork.c | 10 + kernel/sys.c | 11 + kernel/unwind/Makefile | 2 + kernel/unwind/sframe.c | 420 +++++++++++++++++++++++++++++ kernel/unwind/sframe.h | 215 +++++++++++++++ kernel/unwind/user.c | 95 +++++++ mm/init-mm.c | 4 +- 23 files changed, 1086 insertions(+), 73 deletions(-) create mode 100644 arch/x86/include/asm/user_unwind.h create mode 100644 include/linux/sframe.h create mode 100644 include/linux/user_unwind.h create mode 100644 kernel/unwind/Makefile create mode 100644 kernel/unwind/sframe.c create mode 100644 kernel/unwind/sframe.h create mode 100644 kernel/unwind/user.c
This is a new version of sframe user space unwinding + perf deferred callchains. Better late than never. Unfortunately I didn't get a chance to do any testing with this one yet as I'm rushing it out the door before GNU Cauldron starts. Conference-driven development ;-) Plus I don't have the perf tool piece working yet so I don't have a way of doing end-to-end testing at the moment anyway. In other words, please don't merge this yet. Namhyung, if you're still available to write a perf tool patch which integrates with this, that would be great. Otherwise I could give it a try. Steven, let me know if this would interface ok with your anticipated tracing usage. v2: - rebase on v6.11-rc7 - reorganize the patches to add sframe first - change to sframe v2 - add new perf event type: PERF_RECORD_CALLCHAIN_DEFERRED - add new perf attribute: defer_callchain v1: https://lore.kernel.org/cover.1699487758.git.jpoimboe@kernel.org Some distros have started compiling frame pointers into all their packages to enable the kernel to do system-wide profiling of user space. Unfortunately that creates a runtime performance penalty across the entire system. Using DWARF (or .eh_frame) instead isn't feasible because of complexity and slowness. For in-kernel unwinding we solved this problem with the creation of the ORC unwinder for x86_64. Similarly, for user space the GNU assembler has created the SFrame ("Simple Frame") v2 format starting with binutils 2.41. These patches add support for unwinding user space from the kernel using SFrame with perf. It should be easy to add user unwinding support for other components like ftrace. There were two main challenges: 1) Finding .sframe sections in shared/dlopened libraries The kernel has no visibility to the contents of shared libraries. This was solved by adding a PR_ADD_SFRAME option to prctl() which allows the runtime linker to manually provide the in-memory address of an .sframe section to the kernel. 2) Dealing with page faults Keeping all binaries' sframe data pinned would likely waste a lot of memory. Instead, read it from user space on demand. That can't be done from perf NMI context due to page faults, so defer the unwind to the next user exit. Since the NMI handler doesn't do exit work, self-IPI and then schedule task work to be run on exit from the IPI. Special thanks to Indu for the original concept, and to Steven and Peter for helping a lot with the design. And to Steven for letting me do it ;-) Josh Poimboeuf (11): unwind: Introduce generic user space unwinding interface unwind/x86: Add HAVE_USER_UNWIND unwind: Introduce SFrame user space unwinding unwind/x86/64: Add HAVE_USER_UNWIND_SFRAME perf/x86: Use user_unwind interface perf: Remove get_perf_callchain() 'init_nr' argument perf: Remove get_perf_callchain() 'crosstask' argument perf: Simplify get_perf_callchain() user logic perf: Introduce deferred user callchains perf/x86: Add HAVE_PERF_CALLCHAIN_DEFERRED perf/x86: Enable SFrame unwinding for deferred user callchains arch/Kconfig | 9 + arch/x86/Kconfig | 3 + arch/x86/events/core.c | 76 +++--- arch/x86/include/asm/user_unwind.h | 11 + fs/binfmt_elf.c | 47 +++- include/linux/mm_types.h | 3 + include/linux/perf_event.h | 11 +- include/linux/sframe.h | 46 ++++ include/linux/user_unwind.h | 32 +++ include/uapi/linux/elf.h | 1 + include/uapi/linux/perf_event.h | 21 +- include/uapi/linux/prctl.h | 3 + kernel/Makefile | 1 + kernel/bpf/stackmap.c | 8 +- kernel/events/callchain.c | 48 ++-- kernel/events/core.c | 82 +++++- kernel/fork.c | 10 + kernel/sys.c | 11 + kernel/unwind/Makefile | 2 + kernel/unwind/sframe.c | 420 +++++++++++++++++++++++++++++ kernel/unwind/sframe.h | 215 +++++++++++++++ kernel/unwind/user.c | 95 +++++++ mm/init-mm.c | 4 +- 23 files changed, 1086 insertions(+), 73 deletions(-) create mode 100644 arch/x86/include/asm/user_unwind.h create mode 100644 include/linux/sframe.h create mode 100644 include/linux/user_unwind.h create mode 100644 kernel/unwind/Makefile create mode 100644 kernel/unwind/sframe.c create mode 100644 kernel/unwind/sframe.h create mode 100644 kernel/unwind/user.c -- 2.46.0
Hello Josh, On Sat, Sep 14, 2024 at 01:02:02AM +0200, Josh Poimboeuf wrote: > This is a new version of sframe user space unwinding + perf deferred > callchains. Better late than never. > > Unfortunately I didn't get a chance to do any testing with this one yet > as I'm rushing it out the door before GNU Cauldron starts. > Conference-driven development ;-) lol > > Plus I don't have the perf tool piece working yet so I don't have a way > of doing end-to-end testing at the moment anyway. > > In other words, please don't merge this yet. > > Namhyung, if you're still available to write a perf tool patch which > integrates with this, that would be great. Otherwise I could give it a > try. Sure, I'll take a look and post the perf tools patches. Thanks, Namhyung
On Sat, 14 Sep 2024 01:02:02 +0200 Josh Poimboeuf <jpoimboe@kernel.org> wrote: > This is a new version of sframe user space unwinding + perf deferred > callchains. Better late than never. > > Unfortunately I didn't get a chance to do any testing with this one yet > as I'm rushing it out the door before GNU Cauldron starts. > Conference-driven development ;-) Thanks for posting this! BTW, next time, can you also Cc linux-trace-kernel@vger.kerne.org. > > Plus I don't have the perf tool piece working yet so I don't have a way > of doing end-to-end testing at the moment anyway. > > In other words, please don't merge this yet. > > Namhyung, if you're still available to write a perf tool patch which > integrates with this, that would be great. Otherwise I could give it a > try. > > Steven, let me know if this would interface ok with your anticipated > tracing usage. This is a very good starting point for me. A couple of notes. I think the unwinder should have an interface itself that provides the deferred unwinding, instead of having all the tracers to implement their own. The user space unwinder (regardless of sframe or not) should provide a way to say "I want a user space stack trace for here, but don't do it yet. Just give me a cookie and then call my callback function with the stack and cookie before going back to user space". That is, we should have an interface like: typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie); struct unwinder_client { struct list_head list; unwinder_callback_t callback; }; struct unwinder_client my_unwinder = { .callback = mycallback; }; register_unwinder(&my_unwinder); Then a tracer could do this in NMI context: /* Get the current cookie of the current task */ cookie = unwinder_get_cookie(); /* Tell the unwinder to call the callbacks */ unwinder_trigger(); /* then the tracer can save off that cookie, do a kernel stack trace * and save the cookie with that stack trace. */ On return to user space, it will go the ptrace slow path and call all registered callbacks with the cookie, regardless if it asked for it or not. /* Save the stack somewhere, could even allocate memory to do so */ list_for_each_entry(callback, &unwinder_callback, list) { callback->callback(userstack, cookie); } update_next_cookie(); The cookie is a unique number for every entry into the kernel by all tasks. Then it is up to the tracer's callback to know if it should save the trace or ignore it. That is, if perf and ftrace are expecting callbacks, but only perf triggered the callback, it should record the cookie, and save the stack if it matches a cookie it wants. ftrace's callback would ignore the trace, because it did not ask for a callback. That is, the tracer will need to save off the cookie to know if its callback needs to do something or not. The cookie can be generated by this: static u64 next_cookie; u64 unwinder_get_cookie(void) { u64 next; if (current->unwinder_cookie) return current->unwinder_cookie; next = next_cookie; while (!try_cmpxchg64(&next_cookie, &next, next + 1)) ; next++; /* Check for races with NMI */ if (!cmpxchg(¤t->unwinder_cookie, 0, next)) next = current->unwinder_cookie; return next; } When the task goes back to user space and does all the callbacks, it would reset the task cookie. current->unwinder_cookie = 0; Now the loop of callbacks should also catch if unwinder_trigger() was called again (NMI), and may need to recall all the back traces again with the new cookie. That will need some thought about handling that race. This is just an idea to start discussion. -- Steve
On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote: > I think the unwinder should have an interface itself that provides the > deferred unwinding, instead of having all the tracers to implement > their own. > > The user space unwinder (regardless of sframe or not) should provide a > way to say "I want a user space stack trace for here, but don't do it > yet. Just give me a cookie and then call my callback function with the > stack and cookie before going back to user space". We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and I think we're in basic agreement on this. I think the biggest tweak we decided on is that the context id (aka "cookie") would be percpu. Its initial value is (cpuid << 48). It gets incremented for every entry from user space. > That is, we should have an interface like: > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie); > struct unwinder_client { > struct list_head list; > unwinder_callback_t callback; I assume we want to allow tracers to pick sframes or FP (or auto). If so we would need to add a user_unwind_type enum to this struct. Then the question is, what to do if tracer A wants sframe and tracer B wants FP? I'm thinking it's fine to allow that. I assume the "multiple tracers unwinding user space" case isn't realistic so any extra overhead from these cases is the user's fault? The unwinder would need two sets of callbacks and task work functions: one for sframe, one for FP. The tracer would need to pass its &my_unwinder struct to unwinder_trigger() so the unwinder knows which task work function to activate. I thinking we also need a 'max_entries' field. No need to keep unwinding if we've already reached the tracer's max. If there are multiple callbacks then we'd have to get the max of the maxes, but that's easy enough. Mathieu had requested passing an opaque void * from the NMI to the callback, but I don't think that's possible with this scheme since the unwinder wouldn't know which callback to give it to. Instead the tracer would have to keep track of its own data associated with the given cookie. -- Josh
On Sun, Sep 15, 2024 at 01:11:11PM +0200, Josh Poimboeuf wrote: > On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote: > > I think the unwinder should have an interface itself that provides the > > deferred unwinding, instead of having all the tracers to implement > > their own. > > > > The user space unwinder (regardless of sframe or not) should provide a > > way to say "I want a user space stack trace for here, but don't do it > > yet. Just give me a cookie and then call my callback function with the > > stack and cookie before going back to user space". > > We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and > I think we're in basic agreement on this. > > I think the biggest tweak we decided on is that the context id (aka > "cookie") would be percpu. Its initial value is (cpuid << 48). It gets > incremented for every entry from user space. Why? What's the purpose of the cookie? This scheme seems unsound, pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle. > > That is, we should have an interface like: > > > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie); Just make it a void* and let the consumer figure it out.
On Mon, 16 Sep 2024 16:08:56 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > > > > I think the biggest tweak we decided on is that the context id (aka > > "cookie") would be percpu. Its initial value is (cpuid << 48). It gets > > incremented for every entry from user space. > > Why? What's the purpose of the cookie? This scheme seems unsound, pin > yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle. > > > > That is, we should have an interface like: > > > > > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie); > > Just make it a void* and let the consumer figure it out. Let me try to explain the rationale for the "cookie". It's actually a "context_cookie". The cookie is unique for every time user space enters the kernel. Basically it's a "tag" (we could call it that too, but "cookie" seemed more appropriate as it follows what web browsers do). The idea is this: * You're in interrupt context and want a user space back trace, and request to have a user stack trace when the task goes back to user space. Now the unwinder will return the current context cookie" and will use that same cookie when it creates the user stack on exit back to user space. * If this happens in a long system call that does many stack traces (records the kernel stack and also wants the user stack), it does not record the user trace at the time it is requested. But we need a way to tag this event with the user stack trace that will happen when the user stack is recorded when going back to user space. We can't pin the task to the CPU when we want to profile it. The cookie/tag needs to be the same for every request that happens with a single entry into the kernel. It must be different for another entry into the kernel as the user stack will be different then. Basically think that each cookie represents a single user stack. * When going back to user space, the ptrace path is taken and the user stack trace is recorded. It will then call the tracers that requested it with the stack trace and the cookie that represents it. Here's an example: system_call() { <interrupt> trigger user stack trace: assign cookie 123 tracer records kernel stack trace and adds cookie 123. <interrupt> trigger user stack trace: assign cookie 123 tracer records kernel stack trace and adds cookie 123. <interrupt> trigger user stack trace: assign cookie 123 tracer records kernel stack trace and adds cookie 123. <return to user space, take ptrace path> <record user stack trace> call all the tracers with user stack trace with cookie 123. } system_call() { <interrupt> trigger user stack trace: assign cookie 124 tracer records kernel stack trace and adds cookie 124. <return to user space, take ptrace path> <record user stack trace> call all the tracers with user stack trace with cookie 124. } Then the tracers can post process the events and append the user stack trace with cookie 123 on top of the kernel stack events that had cookie 123, and append the user stack trace with cookie 124 on top of the kernel stack events that had cookie 124. -- Steve
On Mon, Sep 16, 2024 at 04:08:56PM +0200, Peter Zijlstra wrote: > On Sun, Sep 15, 2024 at 01:11:11PM +0200, Josh Poimboeuf wrote: > > On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote: > > > I think the unwinder should have an interface itself that provides the > > > deferred unwinding, instead of having all the tracers to implement > > > their own. > > > > > > The user space unwinder (regardless of sframe or not) should provide a > > > way to say "I want a user space stack trace for here, but don't do it > > > yet. Just give me a cookie and then call my callback function with the > > > stack and cookie before going back to user space". > > > > We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and > > I think we're in basic agreement on this. > > > > I think the biggest tweak we decided on is that the context id (aka > > "cookie") would be percpu. Its initial value is (cpuid << 48). It gets > > incremented for every entry from user space. > > Why? What's the purpose of the cookie? The cookie is incremented once per entry from userspace, when needed. It's a unique id used by the tracers which is emitted with both kernel and user stacktrace samples so the userspace tool can stitch them back together. Even if you have multiple tracers triggering at the same time they can all share a single user trace. Completely untested patch here that I hacked up today: git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-2.1 To avoid races, unwind_user_deferred() can't be called from NMI. The tracers will need to trigger irq_work to call it. > This scheme seems unsound It may not be communicated well in this thread. We had much discussion at Cauldron the last few days. The patch may be the best way to communicate it at this point, though I'm sure I missed some details. > pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle. Hm??? > > > That is, we should have an interface like: > > > > > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie); > > Just make it a void* and let the consumer figure it out. I think it makes sense to manage the cookie in the deferred unwinder since the need to correlate the stack traces is an inherent part of the deferral. -- Josh
On Mon, 16 Sep 2024 17:39:53 +0200 Josh Poimboeuf <jpoimboe@kernel.org> wrote: > Completely untested patch here that I hacked up today: I can tell it wasn't tested ;-) > > git://git.kernel.org/pub/scm/linux/kernel/git/jpoimboe/linux.git sframe-2.1 > > To avoid races, unwind_user_deferred() can't be called from NMI. The > tracers will need to trigger irq_work to call it. diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c index 7edb0833fe46..c56cf5d564df 100644 --- a/kernel/unwind/user.c +++ b/kernel/unwind/user.c @@ -177,7 +181,8 @@ int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie) cookie = __this_cpu_read(ctx_ctr); cookie &= ((1UL << 48) - 1); - cookie |= ((cpu << 48) + 1); + cookie |= cpu << 48; + cookie++; __this_cpu_write(ctx_ctr, cookie); current->unwind_ctx_cookie = cookie; As the cookie never got incremented. That was just one issue. Things are still not working but I'll debug the rest later. -- Steve
On Wed, Oct 02, 2024 at 10:31:25PM -0400, Steven Rostedt wrote: > +++ b/kernel/unwind/user.c > @@ -177,7 +181,8 @@ int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie) > > cookie = __this_cpu_read(ctx_ctr); > cookie &= ((1UL << 48) - 1); > - cookie |= ((cpu << 48) + 1); > + cookie |= cpu << 48; > + cookie++; > __this_cpu_write(ctx_ctr, cookie); > > current->unwind_ctx_cookie = cookie; > > As the cookie never got incremented. Ha, that might help ;-) > That was just one issue. Things are still not working but I'll debug the > rest later. If you don't want to run on the bleeding edge, I'll be testing it soon myself once I get the coding done for my v3. -- Josh
On Wed, 2 Oct 2024 19:37:02 -0700 Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > That was just one issue. Things are still not working but I'll debug the > > rest later. > > If you don't want to run on the bleeding edge, I'll be testing it soon > myself once I get the coding done for my v3. Yeah, I'll wait for that. I have some code working for ftrace on top of this, but I also promised Mathieu I would start looking at his faultable trace points, so I should do that before working more on sframe. -- Steve
On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote: > The cookie is incremented once per entry from userspace, when needed. > > It's a unique id used by the tracers which is emitted with both kernel > and user stacktrace samples so the userspace tool can stitch them back > together. Even if you have multiple tracers triggering at the same time > they can all share a single user trace. But perf don't need this at all, right? It knows that the next deferred trace for that tid will be the one. > > This scheme seems unsound > > pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle. > > Hm??? The point being that it is possible to wrap one CPU into the id space of another CPU. It is not trivial, but someone who wants to can make it happen. Combined I don't see the need to force this into this scheme, carry an opaque (void*) pointer and let the user do whatever it wants, perf for instance can pass NULL and not do anything like this.
On 2024-09-16 20:15, Peter Zijlstra wrote: [...] > The point being that it is possible to wrap one CPU into the id space of > another CPU. It is not trivial, but someone who wants to can make it > happen. I agree that the overflow of the free-running counter bleeding into the CPU numbers is something we want to prevent. We don't care if this counter overflows after day/months/years for sake of correlation within a system call, but we do care about the fact that this free-running counter could be made to overflow after a very long time while the system runs, and then we reach a state where the CPU numbers are mixed up, which leads to short-term correlation issues. I would recommend this layout for this 64-bit value instead: low-bits: cpu number high-bits: free-running counter This way, we eliminate any carry from overflow into the cpu number bits. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2024-09-16 02:15, Mathieu Desnoyers wrote: > On 2024-09-16 20:15, Peter Zijlstra wrote: > [...] >> The point being that it is possible to wrap one CPU into the id space of >> another CPU. It is not trivial, but someone who wants to can make it >> happen. > > I agree that the overflow of the free-running counter bleeding into the > CPU numbers is something we want to prevent. We don't care if this > counter overflows after day/months/years for sake of correlation > within a system call, but we do care about the fact that this > free-running counter could be made to overflow after a very long > time while the system runs, and then we reach a state where the > CPU numbers are mixed up, which leads to short-term correlation > issues. > > I would recommend this layout for this 64-bit value instead: > > low-bits: cpu number > high-bits: free-running counter > > This way, we eliminate any carry from overflow into the cpu number bits. Even better: AFAIR from the discussion I had with Steven and Josh, we intend to have the cookie stored to/read from the task struct with interrupts off, we can simply do: struct stackwalk_cookie { uint64_t counter; /* free running per-cpu counter value */ int cpu; /* cpu on which the counter was snapshot. */ }; So we don't have to deal with any overflow trickiness, there is no need for bit-shifting, and we get a full 64-bit free-running counter independently of the number of CPUs. Thanks, Mathieu > > Thanks, > > Mathieu > > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On 2024-09-16 02:33, Mathieu Desnoyers wrote: > On 2024-09-16 02:15, Mathieu Desnoyers wrote: >> On 2024-09-16 20:15, Peter Zijlstra wrote: >> [...] >>> The point being that it is possible to wrap one CPU into the id space of >>> another CPU. It is not trivial, but someone who wants to can make it >>> happen. >> >> I agree that the overflow of the free-running counter bleeding into the >> CPU numbers is something we want to prevent. We don't care if this >> counter overflows after day/months/years for sake of correlation >> within a system call, but we do care about the fact that this >> free-running counter could be made to overflow after a very long >> time while the system runs, and then we reach a state where the >> CPU numbers are mixed up, which leads to short-term correlation >> issues. >> >> I would recommend this layout for this 64-bit value instead: >> >> low-bits: cpu number >> high-bits: free-running counter >> >> This way, we eliminate any carry from overflow into the cpu number bits. > > Even better: AFAIR from the discussion I had with Steven and Josh, we > intend > to have the cookie stored to/read from the task struct with interrupts off, > we can simply do: > > struct stackwalk_cookie { > uint64_t counter; /* free running per-cpu counter value */ > int cpu; /* cpu on which the counter was snapshot. */ > }; > > So we don't have to deal with any overflow trickiness, there is no need for > bit-shifting, and we get a full 64-bit free-running counter > independently of > the number of CPUs. Sorry my laptop had the wrong date when I sent this email. Re-sending to make sure it gets seen in the correct order time-wise. Thanks, Mathieu > > Thanks, > > Mathieu > > > >> >> Thanks, >> >> Mathieu >> >> > -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Mon, 16 Sep 2024 20:15:45 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote: > > > The cookie is incremented once per entry from userspace, when needed. > > > > It's a unique id used by the tracers which is emitted with both kernel > > and user stacktrace samples so the userspace tool can stitch them back > > together. Even if you have multiple tracers triggering at the same time > > they can all share a single user trace. > > But perf don't need this at all, right? It knows that the next deferred > trace for that tid will be the one. Is that because perf uses per task buffers? Will perf know this if it uses a global buffer? What does perf do with "-a"? > > > > This scheme seems unsound > > > > pin yourself on CPU0 and trigger 1<<48 unwinds while keeping CPU1 idle. > > > > Hm??? > > The point being that it is possible to wrap one CPU into the id space of > another CPU. It is not trivial, but someone who wants to can make it > happen. > > Combined I don't see the need to force this into this scheme, carry an > opaque (void*) pointer and let the user do whatever it wants, perf for > instance can pass NULL and not do anything like this. That would put a huge burden on those tracers. Specifically ftrace. Just because perf doesn't need it, it doesn't mean it can't be part of the infrastructure. The cookie isn't even per-cpu. The counter is just generated per cpu to avoid races. The cookie will have the CPU as part of the bits, plus the per-cpu counter. This will give us a unique value across all CPUs without having to do any atomic operations. It's just that the cookie must be unique for every time a task enters the kernel, and stay the same while it is in the kernel. So if the task enters the kernel on CPU 1 and a trace is requested, the tracer still needs to save that event when the request occurred (for example, the kernel stack trace). It needs a unique identifier that will be attached to the user stack trace when it is made, so that post processing can tell where to append that stack. Then the task can migrate to several other CPUs, but it will still have the same cookie as when it received it on the first CPU. The cookie only gets updated on the task when it goes back to user space. ftrace allows for several instances that could be filtering on the same or different tasks. One instance could filter 3 tasks, another 2 tasks, where one of those tasks is in both instance. The user stack infrastructure should give a way to easily map the request to the stack via some kind of identifier. What would perf do on dropped events? Say a system call happened, perf requested a user stack trace, then the events were dropped, and when it started again, it was in a system call, how would perf know it is in the same system call? I guess the callback when the dropped user stack trace occurred would tell perf to reset. Still it adds a pretty big burden for the tracers to handle these corner cases whereas a simple identifier is much more simple and robust. -- Steve
Hello, On Mon, Sep 16, 2024 at 06:46:45PM -0400, Steven Rostedt wrote: > On Mon, 16 Sep 2024 20:15:45 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote: > > > > > The cookie is incremented once per entry from userspace, when needed. > > > > > > It's a unique id used by the tracers which is emitted with both kernel > > > and user stacktrace samples so the userspace tool can stitch them back > > > together. Even if you have multiple tracers triggering at the same time > > > they can all share a single user trace. > > > > But perf don't need this at all, right? It knows that the next deferred > > trace for that tid will be the one. Well.. technically you can sample without tid. But I'm not sure how much it'd be useful if you collect callchains without tid. > > Is that because perf uses per task buffers? Will perf know this if it > uses a global buffer? What does perf do with "-a"? Then it'd use per-cpu ring buffers. But each sample would have pid/tid pair and time so perf tools can match it with a deferred callchian. Thanks, Namhyung
On 2024-09-17 23:58, Namhyung Kim wrote: > Hello, > > On Mon, Sep 16, 2024 at 06:46:45PM -0400, Steven Rostedt wrote: >> On Mon, 16 Sep 2024 20:15:45 +0200 >> Peter Zijlstra <peterz@infradead.org> wrote: >> >>> On Mon, Sep 16, 2024 at 05:39:53PM +0200, Josh Poimboeuf wrote: >>> >>>> The cookie is incremented once per entry from userspace, when needed. >>>> >>>> It's a unique id used by the tracers which is emitted with both kernel >>>> and user stacktrace samples so the userspace tool can stitch them back >>>> together. Even if you have multiple tracers triggering at the same time >>>> they can all share a single user trace. >>> >>> But perf don't need this at all, right? It knows that the next deferred >>> trace for that tid will be the one. > > Well.. technically you can sample without tid. But I'm not sure how > much it'd be useful if you collect callchains without tid. > >> >> Is that because perf uses per task buffers? Will perf know this if it >> uses a global buffer? What does perf do with "-a"? > > Then it'd use per-cpu ring buffers. But each sample would have pid/tid > pair and time so perf tools can match it with a deferred callchian. That semantic correlation based on trace information should work fine if you do not miss important events in the trace. What the unique id cookie provides is robustness against confusion that can arise when events are discarded. Discarded events happen when the event throughput is too high for the ring buffer to handle. The following type of confusion can then arise: - If you miss a stack trace event and then a stack-sample-request event, then the post-processing tools can incorrectly infer causality between a prior stack-sample-request and a following stack trace for the same tid. - If you miss even more information about the end/beginning of lifetime of two threads, post-processing can be confused and associate a prior stack-sample-request from tid=N with a later stack trace for tid=N where N was re-used due to an exit/clone sequence. Saving the unique id cookie along with the stack-sample-request and with the stack trace allows more robust causality relationship between the two. Showing reliable causality information may not be super important for profiling use-cases, but it is important for event tracers. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
On Sun, 15 Sep 2024 13:11:11 +0200 Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Sat, Sep 14, 2024 at 08:12:46AM -0400, Steven Rostedt wrote: > > I think the unwinder should have an interface itself that provides the > > deferred unwinding, instead of having all the tracers to implement > > their own. > > > > The user space unwinder (regardless of sframe or not) should provide a > > way to say "I want a user space stack trace for here, but don't do it > > yet. Just give me a cookie and then call my callback function with the > > stack and cookie before going back to user space". > > We (Steven, Mathieu and I) have been discussing this at GNU Cauldron and > I think we're in basic agreement on this. > > I think the biggest tweak we decided on is that the context id (aka > "cookie") would be percpu. Its initial value is (cpuid << 48). It gets > incremented for every entry from user space. Well, in concept it is incremented for every entry from user space, but in reality it should only be incremented the first time it is referenced when coming into user space. That is, if there's nothing asking for a stack trace, then nothing should be incremented. > > > That is, we should have an interface like: > > > > typedef void (unwinder_callback_t)(struct user_space_stack *, u64 cookie); > > > struct unwinder_client { > > struct list_head list; > > unwinder_callback_t callback; > > I assume we want to allow tracers to pick sframes or FP (or auto). If > so we would need to add a user_unwind_type enum to this struct. Actually, I don't think they should care. A stack trace should just be passed to the callback. How that stack trace is created, is up to the unwinder. sframe should be used if it's avaliable, otherwise it will use the frame pointer. If neither is available, the callback could be called with NULL for the stack trace and then it can figure out what to do via pt_regs. > > Then the question is, what to do if tracer A wants sframe and tracer B > wants FP? I'm thinking it's fine to allow that. I assume the "multiple > tracers unwinding user space" case isn't realistic so any extra overhead > from these cases is the user's fault? > > The unwinder would need two sets of callbacks and task work functions: > one for sframe, one for FP. > > The tracer would need to pass its &my_unwinder struct to > unwinder_trigger() so the unwinder knows which task work function to > activate. > > > I thinking we also need a 'max_entries' field. No need to keep > unwinding if we've already reached the tracer's max. If there are > multiple callbacks then we'd have to get the max of the maxes, but > that's easy enough. We could set a max when the tracer registers a callback. It can be -1 to mean "no limit". I was talking with Mathieu at lunch, and we talked about having the stack information pointer be part of the task_struct. That is, the first time the unwinder is triggered and its doing the unwind (either sfarme or fp), it will allocate memory to store the stack trace information. This information is what is sent to the tracer's callbacks. Again, the tracers do not care how the stack frame was created. This memory is then saved in the task struct (freed when the task exits), and reused for future stack traces. If a stack trace could not be created, the callback just gets NULL and that will tell the tracer that nothing is available for the user space stack trace, and it can try to do something else. But that's not some the unwinder should care about. > > > Mathieu had requested passing an opaque void * from the NMI to the > callback, but I don't think that's possible with this scheme since the > unwinder wouldn't know which callback to give it to. Instead the tracer > would have to keep track of its own data associated with the given > cookie. > Not sure what that means, but from NMI, I would see the tracer doing: u64 cookie; cookie = undwinder_trigger(); And then the tracer is responsible to do something with that cookie. This would also trigger the "task_work()" where the ptrace path is entered when going back to user space, the unwinder will try to create a stack trace (from sframe or fp), saving it in this allocated memory that is then stored via a pointer on the task_struct so it only needs to be allocated once (or if a stack trace is bigger than what was allocated). Then all the tracer callbacks will be called with the stack trace and the cookie. The tracer will decided if it should use the stack trace from the cookie, or just ignore it. But the tracer shouldn't care about sframes or frame pointers. That's an implementation detail of the unwinder. -- Steve
Hello Josh! On 14.09.2024 01:02, Josh Poimboeuf wrote: > This is a new version of sframe user space unwinding + perf deferred > callchains. Better late than never. ... > These patches add support for unwinding user space from the kernel using > SFrame with perf. It should be easy to add user unwinding support for > other components like ftrace. ... We are looking forward to implement support for unwinding of user space using SFrame in kernel/perf on s390. One major concern is that your x86 implementation currently relies on a fallback to unwinding using frame pointer. On s390 unwinding using frame pointer is unsupported, because of lack of proper s390x ABI [1] specification and compiler support. In theory there would be a s390-specific alternative of unwinding using backchain (compiler option -mbackchain), but this has limitations and there is currently no distribution where user space is built with backchain. How much of an issue would it be if s390 could not provide an unwinding fallback implementation? Do you see the possibility to get away without? For s390 support of unwinding using SFrame we would need to make a few changes to your generic unwinding framework in the kernel: - Support for architectures that do not define CFA == SP at callsite: On s390 the CFA is defined as SP at callsite +160. The stack pointer (SP) therefore unwinds as SP = CFA - 160. For that we would introduce e.g. a sp_val_off field (SP value offset from CFA) in struct user_unwind_frame that would default to 0 on all architectures except s390. - Support for architectures where RA is not necessarily saved on stack: On s390 the return address (RA) is not saved (on stack) at function entry. In leaf functions it is not necessarily saved at all. - Support for architectures were RA/FP are saved in registers in leaf functions: On s390 the frame pointer (FP) and return address (RA) registers can be saved in other registers when in leaf functions. In the SFrame format we would encode the DWARF register number as offset using the least-significant bit as indication: offset = (regnum << 1) | 1. Therefore we would need to extend your generic unwinding framework to support FP and RA to be restored from registers. [1]: s390x ELF ABI supplement, https://github.com/IBM/s390x-abi/releases Thanks and regards, Jens -- Jens Remus Linux on Z Development (D3303) and z/VSE Support +49-7031-16-1128 Office jremus@de.ibm.com IBM IBM Deutschland Research & Development GmbH; Vorsitzender des Aufsichtsrats: Wolfgang Wendt; Geschäftsführung: David Faller; Sitz der Gesellschaft: Böblingen; Registergericht: Amtsgericht Stuttgart, HRB 243294 IBM Data Privacy Statement: https://www.ibm.com/privacy/
On Wed, Oct 23, 2024 at 03:22:35PM +0200, Jens Remus wrote: > We are looking forward to implement support for unwinding of user space > using SFrame in kernel/perf on s390. One major concern is that your x86 > implementation currently relies on a fallback to unwinding using frame > pointer. On s390 unwinding using frame pointer is unsupported, because > of lack of proper s390x ABI [1] specification and compiler support. In > theory there would be a s390-specific alternative of unwinding using > backchain (compiler option -mbackchain), but this has limitations and > there is currently no distribution where user space is built with > backchain. > > How much of an issue would it be if s390 could not provide an unwinding > fallback implementation? Do you see the possibility to get away without? No problem, I'll make the fallback dependent on CONFIG_HAVE_UNWIND_USER_FP. > For s390 support of unwinding using SFrame we would need to make a few > changes to your generic unwinding framework in the kernel: Also no problem, I'm sure there will need to be tweaks going forward. Thanks for looking at it! v3 will be posted soon. -- Josh
On Wed, 23 Oct 2024 15:22:35 +0200 Jens Remus <jremus@linux.ibm.com> wrote: > We are looking forward to implement support for unwinding of user space > using SFrame in kernel/perf on s390. One major concern is that your x86 > implementation currently relies on a fallback to unwinding using frame > pointer. On s390 unwinding using frame pointer is unsupported, because > of lack of proper s390x ABI [1] specification and compiler support. In > theory there would be a s390-specific alternative of unwinding using > backchain (compiler option -mbackchain), but this has limitations and > there is currently no distribution where user space is built with > backchain. > > How much of an issue would it be if s390 could not provide an unwinding > fallback implementation? Do you see the possibility to get away without? Yes. Even with x86, there's no guarantee that the applications will have frame pointers available. Basically it just returns a stack frame of one (the IP of where user space entered the kernel). > > For s390 support of unwinding using SFrame we would need to make a few > changes to your generic unwinding framework in the kernel: > > - Support for architectures that do not define CFA == SP at callsite: > On s390 the CFA is defined as SP at callsite +160. The stack pointer > (SP) therefore unwinds as SP = CFA - 160. For that we would introduce > e.g. a sp_val_off field (SP value offset from CFA) in struct > user_unwind_frame that would default to 0 on all architectures except > s390. > > - Support for architectures where RA is not necessarily saved on stack: > On s390 the return address (RA) is not saved (on stack) at function > entry. In leaf functions it is not necessarily saved at all. > > - Support for architectures were RA/FP are saved in registers in leaf > functions: > On s390 the frame pointer (FP) and return address (RA) registers can > be saved in other registers when in leaf functions. In the SFrame > format we would encode the DWARF register number as offset using the > least-significant bit as indication: offset = (regnum << 1) | 1. > Therefore we would need to extend your generic unwinding framework to > support FP and RA to be restored from registers. > > [1]: s390x ELF ABI supplement, > https://github.com/IBM/s390x-abi/releases Note that Indu (who's on the Cc and is also the author of sframes) gave a talk at GNU Cauldron about s390 support. I'm assuming that her new sframe specification will cover all of this. Then we will have to implement it. -- Steve
© 2016 - 2024 Red Hat, Inc.