arch/x86/events/core.c | 33 +++++++++++++++++++++++++++++++++ include/linux/uprobes.h | 2 ++ kernel/events/uprobes.c | 2 ++ 3 files changed, 37 insertions(+)
When tracing user functions with uprobe functionality, it's common to
install the probe (e.g., a BPF program) at the first instruction of the
function. This is often going to be `push %rbp` instruction in function
preamble, which means that within that function frame pointer hasn't
been established yet. This leads to consistently missing an actual
caller of the traced function, because perf_callchain_user() only
records current IP (capturing traced function) and then following frame
pointer chain (which would be caller's frame, containing the address of
caller's caller).
So when we have target_1 -> target_2 -> target_3 call chain and we are
tracing an entry to target_3, captured stack trace will report
target_1 -> target_3 call chain, which is wrong and confusing.
This patch proposes a x86-64-specific heuristic to detect `push %rbp`
(`push %ebp` on 32-bit architecture) instruction being traced. Given
entire kernel implementation of user space stack trace capturing works
under assumption that user space code was compiled with frame pointer
register (%rbp/%ebp) preservation, it seems pretty reasonable to use
this instruction as a strong indicator that this is the entry to the
function. In that case, return address is still pointed to by %rsp/%esp,
so we fetch it and add to stack trace before proceeding to unwind the
rest using frame pointer-based logic.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
v1->v2:
- use native unsigned long for ret_addr (Peter);
- add same logic for compat logic in perf_callchain_user32 (Peter).
arch/x86/events/core.c | 33 +++++++++++++++++++++++++++++++++
include/linux/uprobes.h | 2 ++
kernel/events/uprobes.c | 2 ++
3 files changed, 37 insertions(+)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 5b0dd07b1ef1..60821c1ff2f3 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2833,6 +2833,19 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry_ctx *ent
fp = compat_ptr(ss_base + regs->bp);
pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+ /* see perf_callchain_user() below for why we do this */
+ if (current->utask) {
+ struct arch_uprobe *auprobe = current->utask->auprobe;
+ u32 ret_addr;
+
+ if (auprobe && auprobe->insn[0] == 0x55 /* push %ebp */ &&
+ !__get_user(ret_addr, (const u32 __user *)regs->sp))
+ perf_callchain_store(entry, ret_addr);
+ }
+#endif
+
while (entry->nr < entry->max_stack) {
if (!valid_user_frame(fp, sizeof(frame)))
break;
@@ -2884,6 +2897,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs
return;
pagefault_disable();
+
+#ifdef CONFIG_UPROBES
+ /*
+ * If we are called from uprobe handler, and we are indeed at the very
+ * entry to user function (which is normally a `push %rbp` instruction,
+ * under assumption of application being compiled with frame pointers),
+ * we should read return address from *regs->sp before proceeding
+ * to follow frame pointers, otherwise we'll skip immediate caller
+ * as %rbp is not yet setup.
+ */
+ if (current->utask) {
+ struct arch_uprobe *auprobe = current->utask->auprobe;
+ unsigned long ret_addr;
+
+ if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp/%ebp */ &&
+ !__get_user(ret_addr, (const unsigned long __user *)regs->sp))
+ perf_callchain_store(entry, ret_addr);
+ }
+#endif
+
while (entry->nr < entry->max_stack) {
if (!valid_user_frame(fp, sizeof(frame)))
break;
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b503fafb7fb3..a270a5892ab4 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,8 @@ struct uprobe_task {
struct uprobe *active_uprobe;
unsigned long xol_vaddr;
+ struct arch_uprobe *auprobe;
+
struct return_instance *return_instances;
unsigned int depth;
};
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 99be2adedbc0..6e22e4d80f1e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2082,6 +2082,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
bool need_prep = false; /* prepare return uprobe, when needed */
down_read(&uprobe->register_rwsem);
+ current->utask->auprobe = &uprobe->arch;
for (uc = uprobe->consumers; uc; uc = uc->next) {
int rc = 0;
@@ -2096,6 +2097,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
remove &= rc;
}
+ current->utask->auprobe = NULL;
if (need_prep && !remove)
prepare_uretprobe(uprobe, regs); /* put bp at return */
--
2.43.0
On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote: > When tracing user functions with uprobe functionality, it's common to > install the probe (e.g., a BPF program) at the first instruction of the > function. This is often going to be `push %rbp` instruction in function > preamble, which means that within that function frame pointer hasn't > been established yet. This leads to consistently missing an actual > caller of the traced function, because perf_callchain_user() only > records current IP (capturing traced function) and then following frame > pointer chain (which would be caller's frame, containing the address of > caller's caller). > > So when we have target_1 -> target_2 -> target_3 call chain and we are > tracing an entry to target_3, captured stack trace will report > target_1 -> target_3 call chain, which is wrong and confusing. > > This patch proposes a x86-64-specific heuristic to detect `push %rbp` > (`push %ebp` on 32-bit architecture) instruction being traced. Given > entire kernel implementation of user space stack trace capturing works > under assumption that user space code was compiled with frame pointer > register (%rbp/%ebp) preservation, it seems pretty reasonable to use > this instruction as a strong indicator that this is the entry to the > function. In that case, return address is still pointed to by %rsp/%esp, > so we fetch it and add to stack trace before proceeding to unwind the > rest using frame pointer-based logic. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Should it also check for ENDBR64? When compiled with -fcf-protection=branch, the first instruction of the function will almost always be ENDBR64. I'm not sure about other distros, but at least Fedora compiles its binaries like that. -- Josh
On Tue, Jul 02, 2024 at 04:35:56PM -0700, Josh Poimboeuf wrote: > On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote: > > When tracing user functions with uprobe functionality, it's common to > > install the probe (e.g., a BPF program) at the first instruction of the > > function. This is often going to be `push %rbp` instruction in function > > preamble, which means that within that function frame pointer hasn't > > been established yet. This leads to consistently missing an actual > > caller of the traced function, because perf_callchain_user() only > > records current IP (capturing traced function) and then following frame > > pointer chain (which would be caller's frame, containing the address of > > caller's caller). > > > > So when we have target_1 -> target_2 -> target_3 call chain and we are > > tracing an entry to target_3, captured stack trace will report > > target_1 -> target_3 call chain, which is wrong and confusing. > > > > This patch proposes a x86-64-specific heuristic to detect `push %rbp` > > (`push %ebp` on 32-bit architecture) instruction being traced. Given > > entire kernel implementation of user space stack trace capturing works > > under assumption that user space code was compiled with frame pointer > > register (%rbp/%ebp) preservation, it seems pretty reasonable to use > > this instruction as a strong indicator that this is the entry to the > > function. In that case, return address is still pointed to by %rsp/%esp, > > so we fetch it and add to stack trace before proceeding to unwind the > > rest using frame pointer-based logic. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > Should it also check for ENDBR64? > > When compiled with -fcf-protection=branch, the first instruction of the > function will almost always be ENDBR64. I'm not sure about other > distros, but at least Fedora compiles its binaries like that. BTW, there are some cases (including leaf functions and some stack alignment sequences) where a "push %rbp" can happen inside a function. Then it would presumably add a bogus trace entry. Are such false positives ok? -- Josh
On Tue, Jul 2, 2024 at 4:39 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > On Tue, Jul 02, 2024 at 04:35:56PM -0700, Josh Poimboeuf wrote: > > On Tue, Jul 02, 2024 at 10:18:58AM -0700, Andrii Nakryiko wrote: > > > When tracing user functions with uprobe functionality, it's common to > > > install the probe (e.g., a BPF program) at the first instruction of the > > > function. This is often going to be `push %rbp` instruction in function > > > preamble, which means that within that function frame pointer hasn't > > > been established yet. This leads to consistently missing an actual > > > caller of the traced function, because perf_callchain_user() only > > > records current IP (capturing traced function) and then following frame > > > pointer chain (which would be caller's frame, containing the address of > > > caller's caller). > > > > > > So when we have target_1 -> target_2 -> target_3 call chain and we are > > > tracing an entry to target_3, captured stack trace will report > > > target_1 -> target_3 call chain, which is wrong and confusing. > > > > > > This patch proposes a x86-64-specific heuristic to detect `push %rbp` > > > (`push %ebp` on 32-bit architecture) instruction being traced. Given > > > entire kernel implementation of user space stack trace capturing works > > > under assumption that user space code was compiled with frame pointer > > > register (%rbp/%ebp) preservation, it seems pretty reasonable to use > > > this instruction as a strong indicator that this is the entry to the > > > function. In that case, return address is still pointed to by %rsp/%esp, > > > so we fetch it and add to stack trace before proceeding to unwind the > > > rest using frame pointer-based logic. > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > Should it also check for ENDBR64? > > Sure, I can add a check for endbr64 as well. endbr64 probably can be used not just at function entry, is that right? So it might be another case of false positive (which I think is ok, see below). > > When compiled with -fcf-protection=branch, the first instruction of the > > function will almost always be ENDBR64. I'm not sure about other > > distros, but at least Fedora compiles its binaries like that. > > BTW, there are some cases (including leaf functions and some stack > alignment sequences) where a "push %rbp" can happen inside a function. > Then it would presumably add a bogus trace entry. Are such false > positives ok? I think such cases should be rare. People mostly seem to trace user function entry/exit, rarely if ever they trace something within the function, except for USDT cases, where it will be a nop instruction that they trace. In general, even with false positives, I think it's overwhelmingly better to get correct entry stack trace 99.9% of the time, and in the rest 0.01% cases it's fine having one extra bogus entry (but the rest should still be correct), which should be easy for humans to recognize and filter out, if necessary. > > -- > Josh
On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote: > > > Should it also check for ENDBR64? > > > > > Sure, I can add a check for endbr64 as well. endbr64 probably can be > used not just at function entry, is that right? So it might be another > case of false positive (which I think is ok, see below). Yeah, at least theoretically they could happen in the middle of a function for implementing C switch jump tables. > > > When compiled with -fcf-protection=branch, the first instruction of the > > > function will almost always be ENDBR64. I'm not sure about other > > > distros, but at least Fedora compiles its binaries like that. > > > > BTW, there are some cases (including leaf functions and some stack > > alignment sequences) where a "push %rbp" can happen inside a function. > > Then it would presumably add a bogus trace entry. Are such false > > positives ok? > > I think such cases should be rare. People mostly seem to trace user > function entry/exit, rarely if ever they trace something within the > function, except for USDT cases, where it will be a nop instruction > that they trace. > > In general, even with false positives, I think it's overwhelmingly > better to get correct entry stack trace 99.9% of the time, and in the > rest 0.01% cases it's fine having one extra bogus entry (but the rest > should still be correct), which should be easy for humans to recognize > and filter out, if necessary. Agreed, this is a definite improvement overall. BTW, soon there will be support for sframes instead of frame pointers, at which point these checks should only be done for the frame pointer case. -- Josh
On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote: > > > > Should it also check for ENDBR64? > > > > > > > > Sure, I can add a check for endbr64 as well. endbr64 probably can be > > used not just at function entry, is that right? So it might be another > > case of false positive (which I think is ok, see below). > > Yeah, at least theoretically they could happen in the middle of a > function for implementing C switch jump tables. > > > > > When compiled with -fcf-protection=branch, the first instruction of the > > > > function will almost always be ENDBR64. I'm not sure about other > > > > distros, but at least Fedora compiles its binaries like that. > > > > > > BTW, there are some cases (including leaf functions and some stack > > > alignment sequences) where a "push %rbp" can happen inside a function. > > > Then it would presumably add a bogus trace entry. Are such false > > > positives ok? > > > > I think such cases should be rare. People mostly seem to trace user > > function entry/exit, rarely if ever they trace something within the > > function, except for USDT cases, where it will be a nop instruction > > that they trace. > > > > In general, even with false positives, I think it's overwhelmingly > > better to get correct entry stack trace 99.9% of the time, and in the > > rest 0.01% cases it's fine having one extra bogus entry (but the rest > > should still be correct), which should be easy for humans to recognize > > and filter out, if necessary. > > Agreed, this is a definite improvement overall. Cool, I'll incorporate that into v3 and send it soon. > > BTW, soon there will be support for sframes instead of frame pointers, > at which point these checks should only be done for the frame pointer > case. Nice, this is one of the reasons I've been thinking about asynchronous stack trace capture in BPF (see [0] from recent LSF/MM). Few questions, while we are at it. Does it mean that perf_callchain_user() will support working from sleepable context and will wait for data to be paged in? Is anyone already working on this? Any pointers? [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4 > > -- > Josh
On Tue, Jul 02, 2024 at 08:35:08PM -0700, Andrii Nakryiko wrote: > On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote: > > > In general, even with false positives, I think it's overwhelmingly > > > better to get correct entry stack trace 99.9% of the time, and in the > > > rest 0.01% cases it's fine having one extra bogus entry (but the rest > > > should still be correct), which should be easy for humans to recognize > > > and filter out, if necessary. > > > > Agreed, this is a definite improvement overall. > > Cool, I'll incorporate that into v3 and send it soon. > > > > > BTW, soon there will be support for sframes instead of frame pointers, > > at which point these checks should only be done for the frame pointer > > case. > > Nice, this is one of the reasons I've been thinking about asynchronous > stack trace capture in BPF (see [0] from recent LSF/MM). > [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4 I don't seem to have permission to open it. > Few questions, while we are at it. Does it mean that > perf_callchain_user() will support working from sleepable context and > will wait for data to be paged in? Is anyone already working on this? > Any pointers? I had a prototype here: https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org Hopefully I can get started on v2 soon. -- Josh
On Tue, Jul 2, 2024 at 11:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > On Tue, Jul 02, 2024 at 08:35:08PM -0700, Andrii Nakryiko wrote: > > On Tue, Jul 2, 2024 at 6:11 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > > On Tue, Jul 02, 2024 at 05:06:14PM -0700, Andrii Nakryiko wrote: > > > > In general, even with false positives, I think it's overwhelmingly > > > > better to get correct entry stack trace 99.9% of the time, and in the > > > > rest 0.01% cases it's fine having one extra bogus entry (but the rest > > > > should still be correct), which should be easy for humans to recognize > > > > and filter out, if necessary. > > > > > > Agreed, this is a definite improvement overall. > > > > Cool, I'll incorporate that into v3 and send it soon. > > BTW, if you have a chance, please do take a look at v3 and leave your ack, if you are ok with it. Thanks! > > > > > > BTW, soon there will be support for sframes instead of frame pointers, > > > at which point these checks should only be done for the frame pointer > > > case. > > > > Nice, this is one of the reasons I've been thinking about asynchronous > > stack trace capture in BPF (see [0] from recent LSF/MM). > > [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4 > > I don't seem to have permission to open it. > Argh, sorry, it's under my corporate account which doesn't allow others to view it. Try this, I "published" it, let me know if that still doesn't work: [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub > > Few questions, while we are at it. Does it mean that > > perf_callchain_user() will support working from sleepable context and > > will wait for data to be paged in? Is anyone already working on this? > > Any pointers? > > I had a prototype here: > > https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org > > Hopefully I can get started on v2 soon. Ok, so you are going to work on this. Please cc me on future revisions then. Thanks! > > -- > Josh
On Wed, Jul 03, 2024 at 01:23:39PM -0700, Andrii Nakryiko wrote: > > > [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4 > > > > I don't seem to have permission to open it. > > > > Argh, sorry, it's under my corporate account which doesn't allow > others to view it. Try this, I "published" it, let me know if that > still doesn't work: > > [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub The new link doesn't work either :-) > > > Few questions, while we are at it. Does it mean that > > > perf_callchain_user() will support working from sleepable context and > > > will wait for data to be paged in? Is anyone already working on this? > > > Any pointers? > > > > I had a prototype here: > > > > https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org > > > > Hopefully I can get started on v2 soon. > > Ok, so you are going to work on this. Please cc me on future revisions > then. Thanks! Will do! -- Josh
On Wed, Jul 3, 2024 at 3:41 PM Josh Poimboeuf <jpoimboe@kernel.org> wrote: > > On Wed, Jul 03, 2024 at 01:23:39PM -0700, Andrii Nakryiko wrote: > > > > [0] https://docs.google.com/presentation/d/1k10-HtK7pP5CMMa86dDCdLW55fHOut4co3Zs5akk0t4 > > > > > > I don't seem to have permission to open it. > > > > > > > Argh, sorry, it's under my corporate account which doesn't allow > > others to view it. Try this, I "published" it, let me know if that > > still doesn't work: > > > > [0] https://docs.google.com/presentation/d/e/2PACX-1vRgL3UPbkrznwtNPKn-sSjvan7tFeMqOrIyZAFSSEPYiWG20JGSP80jBmZqGwqMuBGVmv9vyLU4KRTx/pub > > The new link doesn't work either :-) > Goodness, sorry about that. I just recreated it under my public account and shared it with the world. This HAS to work: https://docs.google.com/presentation/d/1eaOf9CVZlCOD6b7_UtZBYMfTyYIDZw9clyjzu-IIOIo > > > > Few questions, while we are at it. Does it mean that > > > > perf_callchain_user() will support working from sleepable context and > > > > will wait for data to be paged in? Is anyone already working on this? > > > > Any pointers? > > > > > > I had a prototype here: > > > > > > https://lkml.kernel.org/lkml/cover.1699487758.git.jpoimboe@kernel.org > > > > > > Hopefully I can get started on v2 soon. > > > > Ok, so you are going to work on this. Please cc me on future revisions > > then. Thanks! > > Will do! > > -- > Josh
© 2016 - 2025 Red Hat, Inc.