[v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

[PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Andrii Nakryiko 1 year, 7 months ago

When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).

So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.

This patch proposes a x86-64-specific heuristic to detect `push %rbp`
(`push %ebp` on 32-bit architecture) instruction being traced. Given
entire kernel implementation of user space stack trace capturing works
under assumption that user space code was compiled with frame pointer
register (%rbp/%ebp) preservation, it seems pretty reasonable to use
this instruction as a strong indicator that this is the entry to the
function. In that case, return address is still pointed to by %rsp/%esp,
so we fetch it and add to stack trace before proceeding to unwind the
rest using frame pointer-based logic.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
v1->v2:
  - use native unsigned long for ret_addr (Peter);
  - add same logic for compat logic in perf_callchain_user32 (Peter).

 arch/x86/events/core.c  | 33 +++++++++++++++++++++++++++++++++
 include/linux/uprobes.h |  2 ++
 kernel/events/uprobes.c |  2 ++
 3 files changed, 37 insertions(+)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..60821c1ff2f3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2833,6 +2833,19 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
 
 	fp = compat_ptr(ss_base + regs->bp);
 	pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+	/* see perf_callchain_user() below for why we do this */
+	if (current->utask) {
+		struct arch_uprobe *auprobe = current->utask->auprobe;
+		u32 ret_addr;
+
+		if (auprobe && auprobe->insn[0] == 0x55 /* push %ebp */ &&
+		    !__get_user(ret_addr, (const u32 __user *)regs->sp))
+			perf_callchain_store(entry, ret_addr);
+	}
+#endif
+
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
@@ -2884,6 +2897,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
 		return;
 
 	pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+	/*
+	 * If we are called from uprobe handler, and we are indeed at the very
+	 * entry to user function (which is normally a `push %rbp` instruction,
+	 * under assumption of application being compiled with frame pointers),
+	 * we should read return address from *regs->sp before proceeding
+	 * to follow frame pointers, otherwise we'll skip immediate caller
+	 * as %rbp is not yet setup.
+	 */
+	if (current->utask) {
+		struct arch_uprobe *auprobe = current->utask->auprobe;
+		unsigned long ret_addr;
+
+		if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp/%ebp */ &&
+		    !__get_user(ret_addr, (const unsigned long __user *)regs->sp))
+			perf_callchain_store(entry, ret_addr);
+	}
+#endif
+
 	while (entry->nr < entry->max_stack) {
 		if (!valid_user_frame(fp, sizeof(frame)))
 			break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a270a5892ab4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@ struct uprobe_task {
 	struct uprobe			*active_uprobe;
 	unsigned long			xol_vaddr;
 
+	struct arch_uprobe              *auprobe;
+
 	struct return_instance		*return_instances;
 	unsigned int			depth;
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..6e22e4d80f1e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2082,6 +2082,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 	bool need_prep = false; /* prepare return uprobe, when needed */
 
 	down_read(&uprobe->register_rwsem);
+	current->utask->auprobe = &uprobe->arch;
 	for (uc = uprobe->consumers; uc; uc = uc->next) {
 		int rc = 0;
 
@@ -2096,6 +2097,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 
 		remove &= rc;
 	}
+	current->utask->auprobe = NULL;
 
 	if (need_prep && !remove)
 		prepare_uretprobe(uprobe, regs); /* put bp at return */
-- 
2.43.0

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Josh Poimboeuf 1 year, 7 months ago

On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote:
> When tracing user functions with uprobe functionality, it's common to
> install the probe (e.g., a BPF program) at the first instruction of the
> function. This is often going to be `push %rbp` instruction in function
> preamble, which means that within that function frame pointer hasn't
> been established yet. This leads to consistently missing an actual
> caller of the traced function, because perf_callchain_user() only
> records current IP (capturing traced function) and then following frame
> pointer chain (which would be caller's frame, containing the address of
> caller's caller).
> 
> So when we have target_1 -> target_2 -> target_3 call chain and we are
> tracing an entry to target_3, captured stack trace will report
> target_1 -> target_3 call chain, which is wrong and confusing.
> 
> This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> (`push %ebp` on 32-bit architecture) instruction being traced. Given
> entire kernel implementation of user space stack trace capturing works
> under assumption that user space code was compiled with frame pointer
> register (%rbp/%ebp) preservation, it seems pretty reasonable to use
> this instruction as a strong indicator that this is the entry to the
> function. In that case, return address is still pointed to by %rsp/%esp,
> so we fetch it and add to stack trace before proceeding to unwind the
> rest using frame pointer-based logic.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Should it also check for ENDBR64?

When compiled with -fcf-protection=branch, the first instruction of the
function will almost always be ENDBR64.  I'm not sure about other
distros, but at least Fedora compiles its binaries like that.

-- 
Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Josh Poimboeuf 1 year, 7 months ago

On Tue, Jul 02, 2024 at 04:35:56PM -0700, Josh Poimboeuf wrote:
> On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote:
> > When tracing user functions with uprobe functionality, it's common to
> > install the probe (e.g., a BPF program) at the first instruction of the
> > function. This is often going to be `push %rbp` instruction in function
> > preamble, which means that within that function frame pointer hasn't
> > been established yet. This leads to consistently missing an actual
> > caller of the traced function, because perf_callchain_user() only
> > records current IP (capturing traced function) and then following frame
> > pointer chain (which would be caller's frame, containing the address of
> > caller's caller).
> > 
> > So when we have target_1 -> target_2 -> target_3 call chain and we are
> > tracing an entry to target_3, captured stack trace will report
> > target_1 -> target_3 call chain, which is wrong and confusing.
> > 
> > This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> > (`push %ebp` on 32-bit architecture) instruction being traced. Given
> > entire kernel implementation of user space stack trace capturing works
> > under assumption that user space code was compiled with frame pointer
> > register (%rbp/%ebp) preservation, it seems pretty reasonable to use
> > this instruction as a strong indicator that this is the entry to the
> > function. In that case, return address is still pointed to by %rsp/%esp,
> > so we fetch it and add to stack trace before proceeding to unwind the
> > rest using frame pointer-based logic.
> > 
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> 
> Should it also check for ENDBR64?
> 
> When compiled with -fcf-protection=branch, the first instruction of the
> function will almost always be ENDBR64.  I'm not sure about other
> distros, but at least Fedora compiles its binaries like that.

BTW, there are some cases (including leaf functions and some stack
alignment sequences) where a "push %rbp" can happen inside a function.
Then it would presumably add a bogus trace entry.  Are such false
positives ok?

-- 
Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Andrii Nakryiko 1 year, 7 months ago

On Tue, Jul 2, 2024 at 4:39 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 04:35:56PM -0700, Josh Poimboeuf wrote:
> > On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote:
> > > When tracing user functions with uprobe functionality, it's common to
> > > install the probe (e.g., a BPF program) at the first instruction of the
> > > function. This is often going to be `push %rbp` instruction in function
> > > preamble, which means that within that function frame pointer hasn't
> > > been established yet. This leads to consistently missing an actual
> > > caller of the traced function, because perf_callchain_user() only
> > > records current IP (capturing traced function) and then following frame
> > > pointer chain (which would be caller's frame, containing the address of
> > > caller's caller).
> > >
> > > So when we have target_1 -> target_2 -> target_3 call chain and we are
> > > tracing an entry to target_3, captured stack trace will report
> > > target_1 -> target_3 call chain, which is wrong and confusing.
> > >
> > > This patch proposes a x86-64-specific heuristic to detect `push %rbp`
> > > (`push %ebp` on 32-bit architecture) instruction being traced. Given
> > > entire kernel implementation of user space stack trace capturing works
> > > under assumption that user space code was compiled with frame pointer
> > > register (%rbp/%ebp) preservation, it seems pretty reasonable to use
> > > this instruction as a strong indicator that this is the entry to the
> > > function. In that case, return address is still pointed to by %rsp/%esp,
> > > so we fetch it and add to stack trace before proceeding to unwind the
> > > rest using frame pointer-based logic.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >
> > Should it also check for ENDBR64?
> >

Sure, I can add a check for endbr64 as well. endbr64 probably can be
used not just at function entry, is that right? So it might be another
case of false positive (which I think is ok, see below).

> > When compiled with -fcf-protection=branch, the first instruction of the
> > function will almost always be ENDBR64.  I'm not sure about other
> > distros, but at least Fedora compiles its binaries like that.
>
> BTW, there are some cases (including leaf functions and some stack
> alignment sequences) where a "push %rbp" can happen inside a function.
> Then it would presumably add a bogus trace entry.  Are such false
> positives ok?

I think such cases should be rare. People mostly seem to trace user
function entry/exit, rarely if ever they trace something within the
function, except for USDT cases, where it will be a nop instruction
that they trace.

In general, even with false positives, I think it's overwhelmingly
better to get correct entry stack trace 99.9% of the time, and in the
rest 0.01% cases it's fine having one extra bogus entry (but the rest
should still be correct), which should be easy for humans to recognize
and filter out, if necessary.

>
> --
> Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Josh Poimboeuf 1 year, 7 months ago

On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > Should it also check for ENDBR64?
> > >
> 
> Sure, I can add a check for endbr64 as well. endbr64 probably can be
> used not just at function entry, is that right? So it might be another
> case of false positive (which I think is ok, see below).

Yeah, at least theoretically they could happen in the middle of a
function for implementing C switch jump tables.

> > > When compiled with -fcf-protection=branch, the first instruction of the
> > > function will almost always be ENDBR64.  I'm not sure about other
> > > distros, but at least Fedora compiles its binaries like that.
> >
> > BTW, there are some cases (including leaf functions and some stack
> > alignment sequences) where a "push %rbp" can happen inside a function.
> > Then it would presumably add a bogus trace entry.  Are such false
> > positives ok?
> 
> I think such cases should be rare. People mostly seem to trace user
> function entry/exit, rarely if ever they trace something within the
> function, except for USDT cases, where it will be a nop instruction
> that they trace.
> 
> In general, even with false positives, I think it's overwhelmingly
> better to get correct entry stack trace 99.9% of the time, and in the
> rest 0.01% cases it's fine having one extra bogus entry (but the rest
> should still be correct), which should be easy for humans to recognize
> and filter out, if necessary.

Agreed, this is a definite improvement overall.

BTW, soon there will be support for sframes instead of frame pointers,
at which point these checks should only be done for the frame pointer
case.

-- 
Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Andrii Nakryiko 1 year, 7 months ago

On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > > Should it also check for ENDBR64?
> > > >
> >
> > Sure, I can add a check for endbr64 as well. endbr64 probably can be
> > used not just at function entry, is that right? So it might be another
> > case of false positive (which I think is ok, see below).
>
> Yeah, at least theoretically they could happen in the middle of a
> function for implementing C switch jump tables.
>
> > > > When compiled with -fcf-protection=branch, the first instruction of the
> > > > function will almost always be ENDBR64.  I'm not sure about other
> > > > distros, but at least Fedora compiles its binaries like that.
> > >
> > > BTW, there are some cases (including leaf functions and some stack
> > > alignment sequences) where a "push %rbp" can happen inside a function.
> > > Then it would presumably add a bogus trace entry.  Are such false
> > > positives ok?
> >
> > I think such cases should be rare. People mostly seem to trace user
> > function entry/exit, rarely if ever they trace something within the
> > function, except for USDT cases, where it will be a nop instruction
> > that they trace.
> >
> > In general, even with false positives, I think it's overwhelmingly
> > better to get correct entry stack trace 99.9% of the time, and in the
> > rest 0.01% cases it's fine having one extra bogus entry (but the rest
> > should still be correct), which should be easy for humans to recognize
> > and filter out, if necessary.
>
> Agreed, this is a definite improvement overall.

Cool, I'll incorporate that into v3 and send it soon.

>
> BTW, soon there will be support for sframes instead of frame pointers,
> at which point these checks should only be done for the frame pointer
> case.

Nice, this is one of the reasons I've been thinking about asynchronous
stack trace capture in BPF (see [0] from recent LSF/MM).

Few questions, while we are at it. Does it mean that
perf_callchain_user() will support working from sleepable context and
will wait for data to be paged in? Is anyone already working on this?
Any pointers?

 [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4

>
> --
> Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Josh Poimboeuf 1 year, 7 months ago

On Tue, Jul 02, 2024 at 08:35:08PM -0700, Andrii Nakryiko wrote:
> On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > In general, even with false positives, I think it's overwhelmingly
> > > better to get correct entry stack trace 99.9% of the time, and in the
> > > rest 0.01% cases it's fine having one extra bogus entry (but the rest
> > > should still be correct), which should be easy for humans to recognize
> > > and filter out, if necessary.
> >
> > Agreed, this is a definite improvement overall.
> 
> Cool, I'll incorporate that into v3 and send it soon.
> 
> >
> > BTW, soon there will be support for sframes instead of frame pointers,
> > at which point these checks should only be done for the frame pointer
> > case.
> 
> Nice, this is one of the reasons I've been thinking about asynchronous
> stack trace capture in BPF (see [0] from recent LSF/MM).
>  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4

I don't seem to have permission to open it.

> Few questions, while we are at it. Does it mean that
> perf_callchain_user() will support working from sleepable context and
> will wait for data to be paged in? Is anyone already working on this?
> Any pointers?

I had a prototype here:

  https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org

Hopefully I can get started on v2 soon.

-- 
Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Andrii Nakryiko 1 year, 7 months ago

On Tue, Jul 2, 2024 at 11:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 08:35:08PM -0700, Andrii Nakryiko wrote:
> > On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
> > > On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote:
> > > > In general, even with false positives, I think it's overwhelmingly
> > > > better to get correct entry stack trace 99.9% of the time, and in the
> > > > rest 0.01% cases it's fine having one extra bogus entry (but the rest
> > > > should still be correct), which should be easy for humans to recognize
> > > > and filter out, if necessary.
> > >
> > > Agreed, this is a definite improvement overall.
> >
> > Cool, I'll incorporate that into v3 and send it soon.
> >

BTW, if you have a chance, please do take a look at v3 and leave your
ack, if you are ok with it. Thanks!

> > >
> > > BTW, soon there will be support for sframes instead of frame pointers,
> > > at which point these checks should only be done for the frame pointer
> > > case.
> >
> > Nice, this is one of the reasons I've been thinking about asynchronous
> > stack trace capture in BPF (see [0] from recent LSF/MM).
> >  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4
>
> I don't seem to have permission to open it.
>

Argh, sorry, it's under my corporate account which doesn't allow
others to view it. Try this, I "published" it, let me know if that
still doesn't work:

  [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub

> > Few questions, while we are at it. Does it mean that
> > perf_callchain_user() will support working from sleepable context and
> > will wait for data to be paged in? Is anyone already working on this?
> > Any pointers?
>
> I had a prototype here:
>
>   https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org
>
> Hopefully I can get started on v2 soon.

Ok, so you are going to work on this. Please cc me on future revisions
then. Thanks!

>
> --
> Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Josh Poimboeuf 1 year, 7 months ago

On Wed, Jul 03, 2024 at 01:23:39PM -0700, Andrii Nakryiko wrote:
> > >  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4
> >
> > I don't seem to have permission to open it.
> >
> 
> Argh, sorry, it's under my corporate account which doesn't allow
> others to view it. Try this, I "published" it, let me know if that
> still doesn't work:
> 
>   [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub

The new link doesn't work either :-)

> > > Few questions, while we are at it. Does it mean that
> > > perf_callchain_user() will support working from sleepable context and
> > > will wait for data to be paged in? Is anyone already working on this?
> > > Any pointers?
> >
> > I had a prototype here:
> >
> >   https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org
> >
> > Hopefully I can get started on v2 soon.
> 
> Ok, so you are going to work on this. Please cc me on future revisions
> then. Thanks!

Will do!

-- 
Josh

Re: [PATCH v2] perf,x86: avoid missing caller address in stack traces captured in uprobe

Posted by Andrii Nakryiko 1 year, 7 months ago

On Wed, Jul 3, 2024 at 3:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote:
>
> On Wed, Jul 03, 2024 at 01:23:39PM -0700, Andrii Nakryiko wrote:
> > > >  [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4
> > >
> > > I don't seem to have permission to open it.
> > >
> >
> > Argh, sorry, it's under my corporate account which doesn't allow
> > others to view it. Try this, I "published" it, let me know if that
> > still doesn't work:
> >
> >   [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub
>
> The new link doesn't work either :-)
>

Goodness, sorry about that. I just recreated it under my public
account and shared it with the world. This HAS to work:

  https://docs.google.com/presentation/d/1eaOf9CVZlCOD6b7_UtZBYMfTyYIDZw9clyjzu-IIOIo

> > > > Few questions, while we are at it. Does it mean that
> > > > perf_callchain_user() will support working from sleepable context and
> > > > will wait for data to be paged in? Is anyone already working on this?
> > > > Any pointers?
> > >
> > > I had a prototype here:
> > >
> > >   https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org
> > >
> > > Hopefully I can get started on v2 soon.
> >
> > Ok, so you are going to work on this. Please cc me on future revisions
> > then. Thanks!
>
> Will do!
>
> --
> Josh