[PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

Jiri Olsa posted 3 patches 1 year, 8 months ago
There is a newer version of this series
[PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Jiri Olsa 1 year, 8 months ago
Adding uretprobe syscall instead of trap to speed up return probe.

At the moment the uretprobe setup/path is:

  - install entry uprobe

  - when the uprobe is hit, it overwrites probed function's return address
    on stack with address of the trampoline that contains breakpoint
    instruction

  - the breakpoint trap code handles the uretprobe consumers execution and
    jumps back to original return address

This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:

  - syscall trampoline must save original value for rax/r11/rcx registers
    on stack - rax is set to syscall number and r11/rcx are changed and
    used by syscall instruction

  - the syscall code reads the original values of those registers and
    restore those values in task's pt_regs area

Even with the extra registers handling code the having uretprobes handled
by syscalls shows speed improvement.

  On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)

  current:

    base           :   15.888 ± 0.033M/s
    uprobe-nop     :    3.016 ± 0.000M/s
    uprobe-push    :    2.832 ± 0.005M/s
    uprobe-ret     :    1.104 ± 0.000M/s
    uretprobe-nop  :    1.487 ± 0.000M/s
    uretprobe-push :    1.456 ± 0.000M/s
    uretprobe-ret  :    0.816 ± 0.001M/s

  with the fix:

    base           :   15.116 ± 0.045M/s
    uprobe-nop     :    3.001 ± 0.045M/s
    uprobe-push    :    2.831 ± 0.004M/s
    uprobe-ret     :    1.102 ± 0.001M/s
    uretprobe-nop  :    1.969 ± 0.001M/s  < 32% speedup
    uretprobe-push :    1.905 ± 0.004M/s  < 30% speedup
    uretprobe-ret  :    0.933 ± 0.002M/s  < 14% speedup

  On Amd (AMD Ryzen 7 5700U)

  current:

    base           :    5.105 ± 0.003M/s
    uprobe-nop     :    1.552 ± 0.002M/s
    uprobe-push    :    1.408 ± 0.003M/s
    uprobe-ret     :    0.827 ± 0.001M/s
    uretprobe-nop  :    0.779 ± 0.001M/s
    uretprobe-push :    0.750 ± 0.001M/s
    uretprobe-ret  :    0.539 ± 0.001M/s

  with the fix:

    base           :    5.119 ± 0.002M/s
    uprobe-nop     :    1.523 ± 0.003M/s
    uprobe-push    :    1.384 ± 0.003M/s
    uprobe-ret     :    0.826 ± 0.002M/s
    uretprobe-nop  :    0.866 ± 0.002M/s  < 11% speedup
    uretprobe-push :    0.826 ± 0.002M/s  < 10% speedup
    uretprobe-ret  :    0.581 ± 0.001M/s  <  7% speedup

Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 arch/x86/kernel/uprobes.c              | 83 ++++++++++++++++++++++++++
 include/linux/syscalls.h               |  2 +
 include/linux/uprobes.h                |  2 +
 include/uapi/asm-generic/unistd.h      |  5 +-
 kernel/events/uprobes.c                | 18 ++++--
 kernel/sys_ni.c                        |  2 +
 7 files changed, 108 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..af0a33ab06ee 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
 459	common	lsm_get_self_attr	sys_lsm_get_self_attr
 460	common	lsm_set_self_attr	sys_lsm_set_self_attr
 461	common	lsm_list_modules	sys_lsm_list_modules
+462	64	uretprobe		sys_uretprobe
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..6fc5d16f6c17 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,6 +12,7 @@
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
 #include <linux/uaccess.h>
+#include <linux/syscalls.h>
 
 #include <linux/kdebug.h>
 #include <asm/processor.h>
@@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
 }
 
 #ifdef CONFIG_X86_64
+
+asm (
+	".pushsection .rodata\n"
+	".global uretprobe_syscall_entry\n"
+	"uretprobe_syscall_entry:\n"
+	"pushq %rax\n"
+	"pushq %rcx\n"
+	"pushq %r11\n"
+	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
+	"syscall\n"
+	"popq %r11\n"
+	"popq %rcx\n"
+
+	/* The uretprobe syscall replaces stored %rax value with final
+	 * return address, so we don't restore %rax in here and just
+	 * call ret.
+	 */
+	"retq\n"
+	".global uretprobe_syscall_end\n"
+	"uretprobe_syscall_end:\n"
+	".popsection\n"
+);
+
+extern u8 uretprobe_syscall_entry[];
+extern u8 uretprobe_syscall_end[];
+
+void *arch_uprobe_trampoline(unsigned long *psize)
+{
+	*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+	return uretprobe_syscall_entry;
+}
+
+SYSCALL_DEFINE0(uretprobe)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long err, ip, sp, r11_cx_ax[3];
+
+	err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
+	WARN_ON_ONCE(err);
+
+	/* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
+	regs->r11 = r11_cx_ax[0];
+	regs->cx  = r11_cx_ax[1];
+	regs->ax  = r11_cx_ax[2];
+	regs->sp += sizeof(r11_cx_ax);
+	regs->orig_ax = -1;
+
+	ip = regs->ip;
+	sp = regs->sp;
+
+	uprobe_handle_trampoline(regs);
+
+	/*
+	 * uprobe_consumer has changed sp, we can do nothing,
+	 * just return via iret
+	 */
+	if (regs->sp != sp)
+		return regs->ax;
+	regs->sp -= sizeof(r11_cx_ax);
+
+	/* for the case uprobe_consumer has changed r11/cx */
+	r11_cx_ax[0] = regs->r11;
+	r11_cx_ax[1] = regs->cx;
+
+	/*
+	 * ax register is passed through as return value, so we can use
+	 * its space on stack for ip value and jump to it through the
+	 * trampoline's ret instruction
+	 */
+	r11_cx_ax[2] = regs->ip;
+	regs->ip = ip;
+
+	err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
+	WARN_ON_ONCE(err);
+
+	/* ensure sysret, see do_syscall_64() */
+	regs->r11 = regs->flags;
+	regs->cx  = regs->ip;
+
+	return regs->ax;
+}
+
 /*
  * If arch_uprobe->insn doesn't use rip-relative addressing, return
  * immediately.  Otherwise, rewrite the instruction so that it accesses
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 77eb9b0e7685..db150794f89d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
 /* x86 */
 asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
 
+asmlinkage long sys_uretprobe(void);
+
 /* pciconfig: alpha, arm, arm64, ia64, sparc */
 asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
 				unsigned long off, unsigned long len,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..a490146ad89d 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
 					 void *src, unsigned long len);
+extern void uprobe_handle_trampoline(struct pt_regs *regs);
+extern void *arch_uprobe_trampoline(unsigned long *psize);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..8a747cd1d735 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 
+#define __NR_uretprobe 462
+__SYSCALL(__NR_uretprobe, sys_uretprobe)
+
 #undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 929e98c62965..90395b16bde0 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
 	return ret;
 }
 
+void * __weak arch_uprobe_trampoline(unsigned long *psize)
+{
+	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+	*psize = UPROBE_SWBP_INSN_SIZE;
+	return &insn;
+}
+
 static struct xol_area *__create_xol_area(unsigned long vaddr)
 {
 	struct mm_struct *mm = current->mm;
-	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+	unsigned long insns_size;
 	struct xol_area *area;
+	void *insns;
 
 	area = kmalloc(sizeof(*area), GFP_KERNEL);
 	if (unlikely(!area))
@@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
 	atomic_set(&area->slot_count, 1);
-	arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
+	insns = arch_uprobe_trampoline(&insns_size);
+	arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
 
 	if (!xol_add_vma(mm, area))
 		return area;
@@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
 	return ri;
 }
 
-static void handle_trampoline(struct pt_regs *regs)
+void uprobe_handle_trampoline(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
 	struct return_instance *ri, *next;
@@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == get_trampoline_vaddr())
-		return handle_trampoline(regs);
+		return uprobe_handle_trampoline(regs);
 
 	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
 	if (!uprobe) {
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..be6195e0d078 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+
+COND_SYSCALL(uretprobe);
-- 
2.44.0

Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Jiri Olsa 1 year, 8 months ago
On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote:

SNIP

>  #include <linux/kdebug.h>
>  #include <asm/processor.h>
> @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
>  }
>  
>  #ifdef CONFIG_X86_64
> +
> +asm (
> +	".pushsection .rodata\n"
> +	".global uretprobe_syscall_entry\n"
> +	"uretprobe_syscall_entry:\n"
> +	"pushq %rax\n"
> +	"pushq %rcx\n"
> +	"pushq %r11\n"
> +	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
> +	"syscall\n"
> +	"popq %r11\n"
> +	"popq %rcx\n"
> +
> +	/* The uretprobe syscall replaces stored %rax value with final
> +	 * return address, so we don't restore %rax in here and just
> +	 * call ret.
> +	 */
> +	"retq\n"
> +	".global uretprobe_syscall_end\n"
> +	"uretprobe_syscall_end:\n"
> +	".popsection\n"
> +);
> +
> +extern u8 uretprobe_syscall_entry[];
> +extern u8 uretprobe_syscall_end[];
> +
> +void *arch_uprobe_trampoline(unsigned long *psize)
> +{
> +	*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> +	return uretprobe_syscall_entry;

fyi I realized this screws 32-bit programs, we either need to add
compat trampoline, or keep the standard breakpoint for them:

+       struct pt_regs *regs = task_pt_regs(current);
+       static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+
+       if (user_64bit_mode(regs)) {
+               *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+               return uretprobe_syscall_entry;
+       }
+
+       *psize = UPROBE_SWBP_INSN_SIZE;
+       return &insn;


not sure it's worth the effort to add the trampoline, I'll check


jirka
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Andrii Nakryiko 1 year, 8 months ago
On Mon, Apr 15, 2024 at 1:25 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote:
>
> SNIP
>
> >  #include <linux/kdebug.h>
> >  #include <asm/processor.h>
> > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> >  }
> >
> >  #ifdef CONFIG_X86_64
> > +
> > +asm (
> > +     ".pushsection .rodata\n"
> > +     ".global uretprobe_syscall_entry\n"
> > +     "uretprobe_syscall_entry:\n"
> > +     "pushq %rax\n"
> > +     "pushq %rcx\n"
> > +     "pushq %r11\n"
> > +     "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > +     "syscall\n"
> > +     "popq %r11\n"
> > +     "popq %rcx\n"
> > +
> > +     /* The uretprobe syscall replaces stored %rax value with final
> > +      * return address, so we don't restore %rax in here and just
> > +      * call ret.
> > +      */
> > +     "retq\n"
> > +     ".global uretprobe_syscall_end\n"
> > +     "uretprobe_syscall_end:\n"
> > +     ".popsection\n"
> > +);
> > +
> > +extern u8 uretprobe_syscall_entry[];
> > +extern u8 uretprobe_syscall_end[];
> > +
> > +void *arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > +     *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > +     return uretprobe_syscall_entry;
>
> fyi I realized this screws 32-bit programs, we either need to add
> compat trampoline, or keep the standard breakpoint for them:
>
> +       struct pt_regs *regs = task_pt_regs(current);
> +       static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> +       if (user_64bit_mode(regs)) {
> +               *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> +               return uretprobe_syscall_entry;
> +       }
> +
> +       *psize = UPROBE_SWBP_INSN_SIZE;
> +       return &insn;
>
>
> not sure it's worth the effort to add the trampoline, I'll check
>

32-bit arch isn't a high-performance target anyways, so I'd probably
not bother and prioritize simplicity and long term maintenance.

>
> jirka
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Masami Hiramatsu (Google) 1 year, 8 months ago
Hi Jiri,

On Tue,  2 Apr 2024 11:33:00 +0200
Jiri Olsa <jolsa@kernel.org> wrote:

> Adding uretprobe syscall instead of trap to speed up return probe.

This is interesting approach. But I doubt we need to add additional
syscall just for this purpose. Can't we use another syscall or ioctl?

Also, we should run syzkaller on this syscall. And if uretprobe is
set in the user function, what happen if the user function directly
calls this syscall? (maybe it consumes shadow stack?)

Thank you,

> 
> At the moment the uretprobe setup/path is:
> 
>   - install entry uprobe
> 
>   - when the uprobe is hit, it overwrites probed function's return address
>     on stack with address of the trampoline that contains breakpoint
>     instruction
> 
>   - the breakpoint trap code handles the uretprobe consumers execution and
>     jumps back to original return address
> 
> This patch replaces the above trampoline's breakpoint instruction with new
> ureprobe syscall call. This syscall does exactly the same job as the trap
> with some more extra work:
> 
>   - syscall trampoline must save original value for rax/r11/rcx registers
>     on stack - rax is set to syscall number and r11/rcx are changed and
>     used by syscall instruction
> 
>   - the syscall code reads the original values of those registers and
>     restore those values in task's pt_regs area
> 
> Even with the extra registers handling code the having uretprobes handled
> by syscalls shows speed improvement.
> 
>   On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
> 
>   current:
> 
>     base           :   15.888 ± 0.033M/s
>     uprobe-nop     :    3.016 ± 0.000M/s
>     uprobe-push    :    2.832 ± 0.005M/s
>     uprobe-ret     :    1.104 ± 0.000M/s
>     uretprobe-nop  :    1.487 ± 0.000M/s
>     uretprobe-push :    1.456 ± 0.000M/s
>     uretprobe-ret  :    0.816 ± 0.001M/s
> 
>   with the fix:
> 
>     base           :   15.116 ± 0.045M/s
>     uprobe-nop     :    3.001 ± 0.045M/s
>     uprobe-push    :    2.831 ± 0.004M/s
>     uprobe-ret     :    1.102 ± 0.001M/s
>     uretprobe-nop  :    1.969 ± 0.001M/s  < 32% speedup
>     uretprobe-push :    1.905 ± 0.004M/s  < 30% speedup
>     uretprobe-ret  :    0.933 ± 0.002M/s  < 14% speedup
> 
>   On Amd (AMD Ryzen 7 5700U)
> 
>   current:
> 
>     base           :    5.105 ± 0.003M/s
>     uprobe-nop     :    1.552 ± 0.002M/s
>     uprobe-push    :    1.408 ± 0.003M/s
>     uprobe-ret     :    0.827 ± 0.001M/s
>     uretprobe-nop  :    0.779 ± 0.001M/s
>     uretprobe-push :    0.750 ± 0.001M/s
>     uretprobe-ret  :    0.539 ± 0.001M/s
> 
>   with the fix:
> 
>     base           :    5.119 ± 0.002M/s
>     uprobe-nop     :    1.523 ± 0.003M/s
>     uprobe-push    :    1.384 ± 0.003M/s
>     uprobe-ret     :    0.826 ± 0.002M/s
>     uretprobe-nop  :    0.866 ± 0.002M/s  < 11% speedup
>     uretprobe-push :    0.826 ± 0.002M/s  < 10% speedup
>     uretprobe-ret  :    0.581 ± 0.001M/s  <  7% speedup
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  arch/x86/kernel/uprobes.c              | 83 ++++++++++++++++++++++++++
>  include/linux/syscalls.h               |  2 +
>  include/linux/uprobes.h                |  2 +
>  include/uapi/asm-generic/unistd.h      |  5 +-
>  kernel/events/uprobes.c                | 18 ++++--
>  kernel/sys_ni.c                        |  2 +
>  7 files changed, 108 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 7e8d46f4147f..af0a33ab06ee 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -383,6 +383,7 @@
>  459	common	lsm_get_self_attr	sys_lsm_get_self_attr
>  460	common	lsm_set_self_attr	sys_lsm_set_self_attr
>  461	common	lsm_list_modules	sys_lsm_list_modules
> +462	64	uretprobe		sys_uretprobe
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index 6c07f6daaa22..6fc5d16f6c17 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -12,6 +12,7 @@
>  #include <linux/ptrace.h>
>  #include <linux/uprobes.h>
>  #include <linux/uaccess.h>
> +#include <linux/syscalls.h>
>  
>  #include <linux/kdebug.h>
>  #include <asm/processor.h>
> @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
>  }
>  
>  #ifdef CONFIG_X86_64
> +
> +asm (
> +	".pushsection .rodata\n"
> +	".global uretprobe_syscall_entry\n"
> +	"uretprobe_syscall_entry:\n"
> +	"pushq %rax\n"
> +	"pushq %rcx\n"
> +	"pushq %r11\n"
> +	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
> +	"syscall\n"
> +	"popq %r11\n"
> +	"popq %rcx\n"
> +
> +	/* The uretprobe syscall replaces stored %rax value with final
> +	 * return address, so we don't restore %rax in here and just
> +	 * call ret.
> +	 */
> +	"retq\n"
> +	".global uretprobe_syscall_end\n"
> +	"uretprobe_syscall_end:\n"
> +	".popsection\n"
> +);
> +
> +extern u8 uretprobe_syscall_entry[];
> +extern u8 uretprobe_syscall_end[];
> +
> +void *arch_uprobe_trampoline(unsigned long *psize)
> +{
> +	*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> +	return uretprobe_syscall_entry;
> +}
> +
> +SYSCALL_DEFINE0(uretprobe)
> +{
> +	struct pt_regs *regs = task_pt_regs(current);
> +	unsigned long err, ip, sp, r11_cx_ax[3];
> +
> +	err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
> +	WARN_ON_ONCE(err);
> +
> +	/* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
> +	regs->r11 = r11_cx_ax[0];
> +	regs->cx  = r11_cx_ax[1];
> +	regs->ax  = r11_cx_ax[2];
> +	regs->sp += sizeof(r11_cx_ax);
> +	regs->orig_ax = -1;
> +
> +	ip = regs->ip;
> +	sp = regs->sp;
> +
> +	uprobe_handle_trampoline(regs);
> +
> +	/*
> +	 * uprobe_consumer has changed sp, we can do nothing,
> +	 * just return via iret
> +	 */
> +	if (regs->sp != sp)
> +		return regs->ax;
> +	regs->sp -= sizeof(r11_cx_ax);
> +
> +	/* for the case uprobe_consumer has changed r11/cx */
> +	r11_cx_ax[0] = regs->r11;
> +	r11_cx_ax[1] = regs->cx;
> +
> +	/*
> +	 * ax register is passed through as return value, so we can use
> +	 * its space on stack for ip value and jump to it through the
> +	 * trampoline's ret instruction
> +	 */
> +	r11_cx_ax[2] = regs->ip;
> +	regs->ip = ip;
> +
> +	err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
> +	WARN_ON_ONCE(err);
> +
> +	/* ensure sysret, see do_syscall_64() */
> +	regs->r11 = regs->flags;
> +	regs->cx  = regs->ip;
> +
> +	return regs->ax;
> +}
> +
>  /*
>   * If arch_uprobe->insn doesn't use rip-relative addressing, return
>   * immediately.  Otherwise, rewrite the instruction so that it accesses
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 77eb9b0e7685..db150794f89d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
>  /* x86 */
>  asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
>  
> +asmlinkage long sys_uretprobe(void);
> +
>  /* pciconfig: alpha, arm, arm64, ia64, sparc */
>  asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
>  				unsigned long off, unsigned long len,
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..a490146ad89d 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
>  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
>  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
>  					 void *src, unsigned long len);
> +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> +extern void *arch_uprobe_trampoline(unsigned long *psize);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 75f00965ab15..8a747cd1d735 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
>  #define __NR_lsm_list_modules 461
>  __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
>  
> +#define __NR_uretprobe 462
> +__SYSCALL(__NR_uretprobe, sys_uretprobe)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 462
> +#define __NR_syscalls 463
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 929e98c62965..90395b16bde0 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
>  	return ret;
>  }
>  
> +void * __weak arch_uprobe_trampoline(unsigned long *psize)
> +{
> +	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> +	*psize = UPROBE_SWBP_INSN_SIZE;
> +	return &insn;
> +}
> +
>  static struct xol_area *__create_xol_area(unsigned long vaddr)
>  {
>  	struct mm_struct *mm = current->mm;
> -	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +	unsigned long insns_size;
>  	struct xol_area *area;
> +	void *insns;
>  
>  	area = kmalloc(sizeof(*area), GFP_KERNEL);
>  	if (unlikely(!area))
> @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
>  	/* Reserve the 1st slot for get_trampoline_vaddr() */
>  	set_bit(0, area->bitmap);
>  	atomic_set(&area->slot_count, 1);
> -	arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
> +	insns = arch_uprobe_trampoline(&insns_size);
> +	arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
>  
>  	if (!xol_add_vma(mm, area))
>  		return area;
> @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
>  	return ri;
>  }
>  
> -static void handle_trampoline(struct pt_regs *regs)
> +void uprobe_handle_trampoline(struct pt_regs *regs)
>  {
>  	struct uprobe_task *utask;
>  	struct return_instance *ri, *next;
> @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
>  
>  	bp_vaddr = uprobe_get_swbp_addr(regs);
>  	if (bp_vaddr == get_trampoline_vaddr())
> -		return handle_trampoline(regs);
> +		return uprobe_handle_trampoline(regs);
>  
>  	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
>  	if (!uprobe) {
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index faad00cce269..be6195e0d078 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
>  
>  /* restartable sequence */
>  COND_SYSCALL(rseq);
> +
> +COND_SYSCALL(uretprobe);
> -- 
> 2.44.0
> 


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Jiri Olsa 1 year, 8 months ago
On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> Hi Jiri,
> 
> On Tue,  2 Apr 2024 11:33:00 +0200
> Jiri Olsa <jolsa@kernel.org> wrote:
> 
> > Adding uretprobe syscall instead of trap to speed up return probe.
> 
> This is interesting approach. But I doubt we need to add additional
> syscall just for this purpose. Can't we use another syscall or ioctl?

so the plan is to optimize entry uprobe in a similar way and given
the syscall is not a scarce resource I wanted to add another syscall
for that one as well

tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
possible to do that, the trampoline will just have to save one or
more additional registers, but adding new syscall seems cleaner to me

> 
> Also, we should run syzkaller on this syscall. And if uretprobe is

right, I'll check on syzkaller

> set in the user function, what happen if the user function directly
> calls this syscall? (maybe it consumes shadow stack?)

the process should receive SIGILL if there's no pending uretprobe for
the current task, or it will trigger uretprobe if there's one pending

but we could limit the syscall to be executed just from the trampoline,
that should prevent all the user space use cases, I'll do that in next
version and add more tests for that

thanks,
jirka


> 
> Thank you,
> 
> > 
> > At the moment the uretprobe setup/path is:
> > 
> >   - install entry uprobe
> > 
> >   - when the uprobe is hit, it overwrites probed function's return address
> >     on stack with address of the trampoline that contains breakpoint
> >     instruction
> > 
> >   - the breakpoint trap code handles the uretprobe consumers execution and
> >     jumps back to original return address
> > 
> > This patch replaces the above trampoline's breakpoint instruction with new
> > ureprobe syscall call. This syscall does exactly the same job as the trap
> > with some more extra work:
> > 
> >   - syscall trampoline must save original value for rax/r11/rcx registers
> >     on stack - rax is set to syscall number and r11/rcx are changed and
> >     used by syscall instruction
> > 
> >   - the syscall code reads the original values of those registers and
> >     restore those values in task's pt_regs area
> > 
> > Even with the extra registers handling code the having uretprobes handled
> > by syscalls shows speed improvement.
> > 
> >   On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
> > 
> >   current:
> > 
> >     base           :   15.888 ± 0.033M/s
> >     uprobe-nop     :    3.016 ± 0.000M/s
> >     uprobe-push    :    2.832 ± 0.005M/s
> >     uprobe-ret     :    1.104 ± 0.000M/s
> >     uretprobe-nop  :    1.487 ± 0.000M/s
> >     uretprobe-push :    1.456 ± 0.000M/s
> >     uretprobe-ret  :    0.816 ± 0.001M/s
> > 
> >   with the fix:
> > 
> >     base           :   15.116 ± 0.045M/s
> >     uprobe-nop     :    3.001 ± 0.045M/s
> >     uprobe-push    :    2.831 ± 0.004M/s
> >     uprobe-ret     :    1.102 ± 0.001M/s
> >     uretprobe-nop  :    1.969 ± 0.001M/s  < 32% speedup
> >     uretprobe-push :    1.905 ± 0.004M/s  < 30% speedup
> >     uretprobe-ret  :    0.933 ± 0.002M/s  < 14% speedup
> > 
> >   On Amd (AMD Ryzen 7 5700U)
> > 
> >   current:
> > 
> >     base           :    5.105 ± 0.003M/s
> >     uprobe-nop     :    1.552 ± 0.002M/s
> >     uprobe-push    :    1.408 ± 0.003M/s
> >     uprobe-ret     :    0.827 ± 0.001M/s
> >     uretprobe-nop  :    0.779 ± 0.001M/s
> >     uretprobe-push :    0.750 ± 0.001M/s
> >     uretprobe-ret  :    0.539 ± 0.001M/s
> > 
> >   with the fix:
> > 
> >     base           :    5.119 ± 0.002M/s
> >     uprobe-nop     :    1.523 ± 0.003M/s
> >     uprobe-push    :    1.384 ± 0.003M/s
> >     uprobe-ret     :    0.826 ± 0.002M/s
> >     uretprobe-nop  :    0.866 ± 0.002M/s  < 11% speedup
> >     uretprobe-push :    0.826 ± 0.002M/s  < 10% speedup
> >     uretprobe-ret  :    0.581 ± 0.001M/s  <  7% speedup
> > 
> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  arch/x86/kernel/uprobes.c              | 83 ++++++++++++++++++++++++++
> >  include/linux/syscalls.h               |  2 +
> >  include/linux/uprobes.h                |  2 +
> >  include/uapi/asm-generic/unistd.h      |  5 +-
> >  kernel/events/uprobes.c                | 18 ++++--
> >  kernel/sys_ni.c                        |  2 +
> >  7 files changed, 108 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 7e8d46f4147f..af0a33ab06ee 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -383,6 +383,7 @@
> >  459	common	lsm_get_self_attr	sys_lsm_get_self_attr
> >  460	common	lsm_set_self_attr	sys_lsm_set_self_attr
> >  461	common	lsm_list_modules	sys_lsm_list_modules
> > +462	64	uretprobe		sys_uretprobe
> >  
> >  #
> >  # Due to a historical design error, certain syscalls are numbered differently
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index 6c07f6daaa22..6fc5d16f6c17 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/ptrace.h>
> >  #include <linux/uprobes.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/syscalls.h>
> >  
> >  #include <linux/kdebug.h>
> >  #include <asm/processor.h>
> > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> >  }
> >  
> >  #ifdef CONFIG_X86_64
> > +
> > +asm (
> > +	".pushsection .rodata\n"
> > +	".global uretprobe_syscall_entry\n"
> > +	"uretprobe_syscall_entry:\n"
> > +	"pushq %rax\n"
> > +	"pushq %rcx\n"
> > +	"pushq %r11\n"
> > +	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > +	"syscall\n"
> > +	"popq %r11\n"
> > +	"popq %rcx\n"
> > +
> > +	/* The uretprobe syscall replaces stored %rax value with final
> > +	 * return address, so we don't restore %rax in here and just
> > +	 * call ret.
> > +	 */
> > +	"retq\n"
> > +	".global uretprobe_syscall_end\n"
> > +	"uretprobe_syscall_end:\n"
> > +	".popsection\n"
> > +);
> > +
> > +extern u8 uretprobe_syscall_entry[];
> > +extern u8 uretprobe_syscall_end[];
> > +
> > +void *arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > +	*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > +	return uretprobe_syscall_entry;
> > +}
> > +
> > +SYSCALL_DEFINE0(uretprobe)
> > +{
> > +	struct pt_regs *regs = task_pt_regs(current);
> > +	unsigned long err, ip, sp, r11_cx_ax[3];
> > +
> > +	err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
> > +	WARN_ON_ONCE(err);
> > +
> > +	/* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
> > +	regs->r11 = r11_cx_ax[0];
> > +	regs->cx  = r11_cx_ax[1];
> > +	regs->ax  = r11_cx_ax[2];
> > +	regs->sp += sizeof(r11_cx_ax);
> > +	regs->orig_ax = -1;
> > +
> > +	ip = regs->ip;
> > +	sp = regs->sp;
> > +
> > +	uprobe_handle_trampoline(regs);
> > +
> > +	/*
> > +	 * uprobe_consumer has changed sp, we can do nothing,
> > +	 * just return via iret
> > +	 */
> > +	if (regs->sp != sp)
> > +		return regs->ax;
> > +	regs->sp -= sizeof(r11_cx_ax);
> > +
> > +	/* for the case uprobe_consumer has changed r11/cx */
> > +	r11_cx_ax[0] = regs->r11;
> > +	r11_cx_ax[1] = regs->cx;
> > +
> > +	/*
> > +	 * ax register is passed through as return value, so we can use
> > +	 * its space on stack for ip value and jump to it through the
> > +	 * trampoline's ret instruction
> > +	 */
> > +	r11_cx_ax[2] = regs->ip;
> > +	regs->ip = ip;
> > +
> > +	err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
> > +	WARN_ON_ONCE(err);
> > +
> > +	/* ensure sysret, see do_syscall_64() */
> > +	regs->r11 = regs->flags;
> > +	regs->cx  = regs->ip;
> > +
> > +	return regs->ax;
> > +}
> > +
> >  /*
> >   * If arch_uprobe->insn doesn't use rip-relative addressing, return
> >   * immediately.  Otherwise, rewrite the instruction so that it accesses
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 77eb9b0e7685..db150794f89d 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
> >  /* x86 */
> >  asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
> >  
> > +asmlinkage long sys_uretprobe(void);
> > +
> >  /* pciconfig: alpha, arm, arm64, ia64, sparc */
> >  asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
> >  				unsigned long off, unsigned long len,
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index f46e0ca0169c..a490146ad89d 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> >  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> >  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> >  					 void *src, unsigned long len);
> > +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > +extern void *arch_uprobe_trampoline(unsigned long *psize);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index 75f00965ab15..8a747cd1d735 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
> >  #define __NR_lsm_list_modules 461
> >  __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
> >  
> > +#define __NR_uretprobe 462
> > +__SYSCALL(__NR_uretprobe, sys_uretprobe)
> > +
> >  #undef __NR_syscalls
> > -#define __NR_syscalls 462
> > +#define __NR_syscalls 463
> >  
> >  /*
> >   * 32 bit systems traditionally used different
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..90395b16bde0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> >  	return ret;
> >  }
> >  
> > +void * __weak arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > +	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > +
> > +	*psize = UPROBE_SWBP_INSN_SIZE;
> > +	return &insn;
> > +}
> > +
> >  static struct xol_area *__create_xol_area(unsigned long vaddr)
> >  {
> >  	struct mm_struct *mm = current->mm;
> > -	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > +	unsigned long insns_size;
> >  	struct xol_area *area;
> > +	void *insns;
> >  
> >  	area = kmalloc(sizeof(*area), GFP_KERNEL);
> >  	if (unlikely(!area))
> > @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> >  	/* Reserve the 1st slot for get_trampoline_vaddr() */
> >  	set_bit(0, area->bitmap);
> >  	atomic_set(&area->slot_count, 1);
> > -	arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
> > +	insns = arch_uprobe_trampoline(&insns_size);
> > +	arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
> >  
> >  	if (!xol_add_vma(mm, area))
> >  		return area;
> > @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> >  	return ri;
> >  }
> >  
> > -static void handle_trampoline(struct pt_regs *regs)
> > +void uprobe_handle_trampoline(struct pt_regs *regs)
> >  {
> >  	struct uprobe_task *utask;
> >  	struct return_instance *ri, *next;
> > @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
> >  
> >  	bp_vaddr = uprobe_get_swbp_addr(regs);
> >  	if (bp_vaddr == get_trampoline_vaddr())
> > -		return handle_trampoline(regs);
> > +		return uprobe_handle_trampoline(regs);
> >  
> >  	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> >  	if (!uprobe) {
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index faad00cce269..be6195e0d078 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
> >  
> >  /* restartable sequence */
> >  COND_SYSCALL(rseq);
> > +
> > +COND_SYSCALL(uretprobe);
> > -- 
> > 2.44.0
> > 
> 
> 
> -- 
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Masami Hiramatsu (Google) 1 year, 8 months ago
On Wed, 3 Apr 2024 11:47:41 +0200
Jiri Olsa <olsajiri@gmail.com> wrote:

> On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > Hi Jiri,
> > 
> > On Tue,  2 Apr 2024 11:33:00 +0200
> > Jiri Olsa <jolsa@kernel.org> wrote:
> > 
> > > Adding uretprobe syscall instead of trap to speed up return probe.
> > 
> > This is interesting approach. But I doubt we need to add additional
> > syscall just for this purpose. Can't we use another syscall or ioctl?
> 
> so the plan is to optimize entry uprobe in a similar way and given
> the syscall is not a scarce resource I wanted to add another syscall
> for that one as well
> 
> tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> possible to do that, the trampoline will just have to save one or
> more additional registers, but adding new syscall seems cleaner to me

Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.

> 
> > 
> > Also, we should run syzkaller on this syscall. And if uretprobe is
> 
> right, I'll check on syzkaller
> 
> > set in the user function, what happen if the user function directly
> > calls this syscall? (maybe it consumes shadow stack?)
> 
> the process should receive SIGILL if there's no pending uretprobe for
> the current task, or it will trigger uretprobe if there's one pending

No, that is too aggressive and not safe. Since the syscall is exposed to
user program, it should return appropriate error code instead of SIGILL.

> 
> but we could limit the syscall to be executed just from the trampoline,
> that should prevent all the user space use cases, I'll do that in next
> version and add more tests for that

Why not limit? :) The uprobe_handle_trampoline() expects it is called
only from the trampoline, so it is natural to check the caller address.
(and uprobe should know where is the trampoline)

Since the syscall is always exposed to the user program, it should
- Do nothing and return an error unless it is properly called.
- check the prerequisites for operation strictly.
I concern that new system calls introduce vulnerabilities.

Thank you,


> 
> thanks,
> jirka
> 
> 
> > 
> > Thank you,
> > 
> > > 
> > > At the moment the uretprobe setup/path is:
> > > 
> > >   - install entry uprobe
> > > 
> > >   - when the uprobe is hit, it overwrites probed function's return address
> > >     on stack with address of the trampoline that contains breakpoint
> > >     instruction
> > > 
> > >   - the breakpoint trap code handles the uretprobe consumers execution and
> > >     jumps back to original return address
> > > 
> > > This patch replaces the above trampoline's breakpoint instruction with new
> > > ureprobe syscall call. This syscall does exactly the same job as the trap
> > > with some more extra work:
> > > 
> > >   - syscall trampoline must save original value for rax/r11/rcx registers
> > >     on stack - rax is set to syscall number and r11/rcx are changed and
> > >     used by syscall instruction
> > > 
> > >   - the syscall code reads the original values of those registers and
> > >     restore those values in task's pt_regs area
> > > 
> > > Even with the extra registers handling code the having uretprobes handled
> > > by syscalls shows speed improvement.
> > > 
> > >   On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
> > > 
> > >   current:
> > > 
> > >     base           :   15.888 ± 0.033M/s
> > >     uprobe-nop     :    3.016 ± 0.000M/s
> > >     uprobe-push    :    2.832 ± 0.005M/s
> > >     uprobe-ret     :    1.104 ± 0.000M/s
> > >     uretprobe-nop  :    1.487 ± 0.000M/s
> > >     uretprobe-push :    1.456 ± 0.000M/s
> > >     uretprobe-ret  :    0.816 ± 0.001M/s
> > > 
> > >   with the fix:
> > > 
> > >     base           :   15.116 ± 0.045M/s
> > >     uprobe-nop     :    3.001 ± 0.045M/s
> > >     uprobe-push    :    2.831 ± 0.004M/s
> > >     uprobe-ret     :    1.102 ± 0.001M/s
> > >     uretprobe-nop  :    1.969 ± 0.001M/s  < 32% speedup
> > >     uretprobe-push :    1.905 ± 0.004M/s  < 30% speedup
> > >     uretprobe-ret  :    0.933 ± 0.002M/s  < 14% speedup
> > > 
> > >   On Amd (AMD Ryzen 7 5700U)
> > > 
> > >   current:
> > > 
> > >     base           :    5.105 ± 0.003M/s
> > >     uprobe-nop     :    1.552 ± 0.002M/s
> > >     uprobe-push    :    1.408 ± 0.003M/s
> > >     uprobe-ret     :    0.827 ± 0.001M/s
> > >     uretprobe-nop  :    0.779 ± 0.001M/s
> > >     uretprobe-push :    0.750 ± 0.001M/s
> > >     uretprobe-ret  :    0.539 ± 0.001M/s
> > > 
> > >   with the fix:
> > > 
> > >     base           :    5.119 ± 0.002M/s
> > >     uprobe-nop     :    1.523 ± 0.003M/s
> > >     uprobe-push    :    1.384 ± 0.003M/s
> > >     uprobe-ret     :    0.826 ± 0.002M/s
> > >     uretprobe-nop  :    0.866 ± 0.002M/s  < 11% speedup
> > >     uretprobe-push :    0.826 ± 0.002M/s  < 10% speedup
> > >     uretprobe-ret  :    0.581 ± 0.001M/s  <  7% speedup
> > > 
> > > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> > > Suggested-by: Andrii Nakryiko <andrii@kernel.org>
> > > Acked-by: Andrii Nakryiko <andrii@kernel.org>
> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com>
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> > >  arch/x86/kernel/uprobes.c              | 83 ++++++++++++++++++++++++++
> > >  include/linux/syscalls.h               |  2 +
> > >  include/linux/uprobes.h                |  2 +
> > >  include/uapi/asm-generic/unistd.h      |  5 +-
> > >  kernel/events/uprobes.c                | 18 ++++--
> > >  kernel/sys_ni.c                        |  2 +
> > >  7 files changed, 108 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > > index 7e8d46f4147f..af0a33ab06ee 100644
> > > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > > @@ -383,6 +383,7 @@
> > >  459	common	lsm_get_self_attr	sys_lsm_get_self_attr
> > >  460	common	lsm_set_self_attr	sys_lsm_set_self_attr
> > >  461	common	lsm_list_modules	sys_lsm_list_modules
> > > +462	64	uretprobe		sys_uretprobe
> > >  
> > >  #
> > >  # Due to a historical design error, certain syscalls are numbered differently
> > > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > > index 6c07f6daaa22..6fc5d16f6c17 100644
> > > --- a/arch/x86/kernel/uprobes.c
> > > +++ b/arch/x86/kernel/uprobes.c
> > > @@ -12,6 +12,7 @@
> > >  #include <linux/ptrace.h>
> > >  #include <linux/uprobes.h>
> > >  #include <linux/uaccess.h>
> > > +#include <linux/syscalls.h>
> > >  
> > >  #include <linux/kdebug.h>
> > >  #include <asm/processor.h>
> > > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool
> > >  }
> > >  
> > >  #ifdef CONFIG_X86_64
> > > +
> > > +asm (
> > > +	".pushsection .rodata\n"
> > > +	".global uretprobe_syscall_entry\n"
> > > +	"uretprobe_syscall_entry:\n"
> > > +	"pushq %rax\n"
> > > +	"pushq %rcx\n"
> > > +	"pushq %r11\n"
> > > +	"movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > > +	"syscall\n"
> > > +	"popq %r11\n"
> > > +	"popq %rcx\n"
> > > +
> > > +	/* The uretprobe syscall replaces stored %rax value with final
> > > +	 * return address, so we don't restore %rax in here and just
> > > +	 * call ret.
> > > +	 */
> > > +	"retq\n"
> > > +	".global uretprobe_syscall_end\n"
> > > +	"uretprobe_syscall_end:\n"
> > > +	".popsection\n"
> > > +);
> > > +
> > > +extern u8 uretprobe_syscall_entry[];
> > > +extern u8 uretprobe_syscall_end[];
> > > +
> > > +void *arch_uprobe_trampoline(unsigned long *psize)
> > > +{
> > > +	*psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > > +	return uretprobe_syscall_entry;
> > > +}
> > > +
> > > +SYSCALL_DEFINE0(uretprobe)
> > > +{
> > > +	struct pt_regs *regs = task_pt_regs(current);
> > > +	unsigned long err, ip, sp, r11_cx_ax[3];
> > > +
> > > +	err = copy_from_user(r11_cx_ax, (void __user *)regs->sp, sizeof(r11_cx_ax));
> > > +	WARN_ON_ONCE(err);
> > > +
> > > +	/* expose the "right" values of r11/cx/ax/sp to uprobe_consumer/s */
> > > +	regs->r11 = r11_cx_ax[0];
> > > +	regs->cx  = r11_cx_ax[1];
> > > +	regs->ax  = r11_cx_ax[2];
> > > +	regs->sp += sizeof(r11_cx_ax);
> > > +	regs->orig_ax = -1;
> > > +
> > > +	ip = regs->ip;
> > > +	sp = regs->sp;
> > > +
> > > +	uprobe_handle_trampoline(regs);
> > > +
> > > +	/*
> > > +	 * uprobe_consumer has changed sp, we can do nothing,
> > > +	 * just return via iret
> > > +	 */
> > > +	if (regs->sp != sp)
> > > +		return regs->ax;
> > > +	regs->sp -= sizeof(r11_cx_ax);
> > > +
> > > +	/* for the case uprobe_consumer has changed r11/cx */
> > > +	r11_cx_ax[0] = regs->r11;
> > > +	r11_cx_ax[1] = regs->cx;
> > > +
> > > +	/*
> > > +	 * ax register is passed through as return value, so we can use
> > > +	 * its space on stack for ip value and jump to it through the
> > > +	 * trampoline's ret instruction
> > > +	 */
> > > +	r11_cx_ax[2] = regs->ip;
> > > +	regs->ip = ip;
> > > +
> > > +	err = copy_to_user((void __user *)regs->sp, r11_cx_ax, sizeof(r11_cx_ax));
> > > +	WARN_ON_ONCE(err);
> > > +
> > > +	/* ensure sysret, see do_syscall_64() */
> > > +	regs->r11 = regs->flags;
> > > +	regs->cx  = regs->ip;
> > > +
> > > +	return regs->ax;
> > > +}
> > > +
> > >  /*
> > >   * If arch_uprobe->insn doesn't use rip-relative addressing, return
> > >   * immediately.  Otherwise, rewrite the instruction so that it accesses
> > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > > index 77eb9b0e7685..db150794f89d 100644
> > > --- a/include/linux/syscalls.h
> > > +++ b/include/linux/syscalls.h
> > > @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, size_t *size, u32 flags);
> > >  /* x86 */
> > >  asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
> > >  
> > > +asmlinkage long sys_uretprobe(void);
> > > +
> > >  /* pciconfig: alpha, arm, arm64, ia64, sparc */
> > >  asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
> > >  				unsigned long off, unsigned long len,
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index f46e0ca0169c..a490146ad89d 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> > >  extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
> > >  extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
> > >  					 void *src, unsigned long len);
> > > +extern void uprobe_handle_trampoline(struct pt_regs *regs);
> > > +extern void *arch_uprobe_trampoline(unsigned long *psize);
> > >  #else /* !CONFIG_UPROBES */
> > >  struct uprobes_state {
> > >  };
> > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > > index 75f00965ab15..8a747cd1d735 100644
> > > --- a/include/uapi/asm-generic/unistd.h
> > > +++ b/include/uapi/asm-generic/unistd.h
> > > @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
> > >  #define __NR_lsm_list_modules 461
> > >  __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
> > >  
> > > +#define __NR_uretprobe 462
> > > +__SYSCALL(__NR_uretprobe, sys_uretprobe)
> > > +
> > >  #undef __NR_syscalls
> > > -#define __NR_syscalls 462
> > > +#define __NR_syscalls 463
> > >  
> > >  /*
> > >   * 32 bit systems traditionally used different
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 929e98c62965..90395b16bde0 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -1474,11 +1474,20 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)
> > >  	return ret;
> > >  }
> > >  
> > > +void * __weak arch_uprobe_trampoline(unsigned long *psize)
> > > +{
> > > +	static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > > +
> > > +	*psize = UPROBE_SWBP_INSN_SIZE;
> > > +	return &insn;
> > > +}
> > > +
> > >  static struct xol_area *__create_xol_area(unsigned long vaddr)
> > >  {
> > >  	struct mm_struct *mm = current->mm;
> > > -	uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> > > +	unsigned long insns_size;
> > >  	struct xol_area *area;
> > > +	void *insns;
> > >  
> > >  	area = kmalloc(sizeof(*area), GFP_KERNEL);
> > >  	if (unlikely(!area))
> > > @@ -1502,7 +1511,8 @@ static struct xol_area *__create_xol_area(unsigned long vaddr)
> > >  	/* Reserve the 1st slot for get_trampoline_vaddr() */
> > >  	set_bit(0, area->bitmap);
> > >  	atomic_set(&area->slot_count, 1);
> > > -	arch_uprobe_copy_ixol(area->pages[0], 0, &insn, UPROBE_SWBP_INSN_SIZE);
> > > +	insns = arch_uprobe_trampoline(&insns_size);
> > > +	arch_uprobe_copy_ixol(area->pages[0], 0, insns, insns_size);
> > >  
> > >  	if (!xol_add_vma(mm, area))
> > >  		return area;
> > > @@ -2123,7 +2133,7 @@ static struct return_instance *find_next_ret_chain(struct return_instance *ri)
> > >  	return ri;
> > >  }
> > >  
> > > -static void handle_trampoline(struct pt_regs *regs)
> > > +void uprobe_handle_trampoline(struct pt_regs *regs)
> > >  {
> > >  	struct uprobe_task *utask;
> > >  	struct return_instance *ri, *next;
> > > @@ -2188,7 +2198,7 @@ static void handle_swbp(struct pt_regs *regs)
> > >  
> > >  	bp_vaddr = uprobe_get_swbp_addr(regs);
> > >  	if (bp_vaddr == get_trampoline_vaddr())
> > > -		return handle_trampoline(regs);
> > > +		return uprobe_handle_trampoline(regs);
> > >  
> > >  	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> > >  	if (!uprobe) {
> > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > > index faad00cce269..be6195e0d078 100644
> > > --- a/kernel/sys_ni.c
> > > +++ b/kernel/sys_ni.c
> > > @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
> > >  
> > >  /* restartable sequence */
> > >  COND_SYSCALL(rseq);
> > > +
> > > +COND_SYSCALL(uretprobe);
> > > -- 
> > > 2.44.0
> > > 
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Andrii Nakryiko 1 year, 8 months ago
On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> Jiri Olsa <olsajiri@gmail.com> wrote:
>
> > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > Hi Jiri,
> > >
> > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > >
> > > This is interesting approach. But I doubt we need to add additional
> > > syscall just for this purpose. Can't we use another syscall or ioctl?
> >
> > so the plan is to optimize entry uprobe in a similar way and given
> > the syscall is not a scarce resource I wanted to add another syscall
> > for that one as well
> >
> > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > possible to do that, the trampoline will just have to save one or
> > more additional registers, but adding new syscall seems cleaner to me
>
> Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.

I think both ptrace and prctl are for completely different use cases
and it would be an abuse of existing API to reuse them for uretprobe
tracing. Also, keep in mind, that any extra argument that has to be
passed into this syscall means that we need to complicate and slow
generated assembly code that is injected into user process (to
save/restore registers) and also kernel-side (again, to deal with all
the extra registers that would be stored/restored on stack).

Given syscalls are not some kind of scarce resources, what's the
downside to have a dedicated and simple syscall?

>
> >
> > >
> > > Also, we should run syzkaller on this syscall. And if uretprobe is
> >
> > right, I'll check on syzkaller
> >
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.
>

This is the way it is today with uretprobes even through interrupt.
E.g., it could happen that user process is using fibers and is
replacing stack pointer without kernel realizing this, which will
trigger some defensive checks in uretprobe handling code and kernel
will send SIGILL because it can't support such cases. This is
happening today already, and it works fine in practice (except for
applications that manually change stack pointer, too bad, you can't
trace them with uretprobes, unfortunately).

So I think it's absolutely adequate to have this behavior if the user
process is *intentionally* abusing this API.

> >
> > but we could limit the syscall to be executed just from the trampoline,
> > that should prevent all the user space use cases, I'll do that in next
> > version and add more tests for that
>
> Why not limit? :) The uprobe_handle_trampoline() expects it is called
> only from the trampoline, so it is natural to check the caller address.
> (and uprobe should know where is the trampoline)
>
> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.
> I concern that new system calls introduce vulnerabilities.
>

As Oleg and Jiri mentioned, this syscall can't harm kernel or other
processes, only the process that is abusing the API. So any extra
checks that would slow down this approach is an unnecessary overhead
and complication that will never be useful in practice.

Also note that sys_uretprobe is a kind of internal and unstable API
and it is explicitly called out that its contract can change at any
time and user space shouldn't rely on it. It's purely for the kernel's
own usage.

So let's please keep it fast and simple.


> Thank you,
>
>
> >
> > thanks,
> > jirka
> >
> >
> > >

[...]
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Oleg Nesterov 1 year, 8 months ago
Again, I leave this to you and Jiri, but

On 04/03, Masami Hiramatsu wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.

...

> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.

We have sys_munmap(). should it check if the caller is going to unmap
the code region which contains regs->ip and do nothing?

I don't think it should. Userspace should blame itself, SIGSEGV is not
"too aggressive" in this case.

> I concern that new system calls introduce vulnerabilities.

Yes, we need to ensure that sys_uretprobe() can only damage the malicious
caller and nothing else.

Oleg.
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
Posted by Oleg Nesterov 1 year, 8 months ago
I leave this to you and Masami, but...

On 04/03, Jiri Olsa wrote:
>
> On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> >
> > This is interesting approach. But I doubt we need to add additional
> > syscall just for this purpose. Can't we use another syscall or ioctl?
>
> so the plan is to optimize entry uprobe in a similar way and given
> the syscall is not a scarce resource I wanted to add another syscall
> for that one as well
>
> tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> possible to do that, the trampoline will just have to save one or
> more additional registers, but adding new syscall seems cleaner to me

Agreed.

> > Also, we should run syzkaller on this syscall. And if uretprobe is
>
> right, I'll check on syzkaller

I don't understand this concern...

> > set in the user function, what happen if the user function directly
> > calls this syscall? (maybe it consumes shadow stack?)
>
> the process should receive SIGILL if there's no pending uretprobe for
> the current task,

Yes,

> or it will trigger uretprobe if there's one pending

... and corrupt the caller. So what?

> but we could limit the syscall to be executed just from the trampoline,
> that should prevent all the user space use cases, I'll do that in next
> version and add more tests for that

Yes, we can... well, ignoring the race with mremap() from another thread.

But why should we care?

Userspace should not call sys_uretprobe(). Likewise, it should not call
sys_restart_syscall(). Likewise, it should not jump to xol_area.

Of course, userspace (especially syzkaller) _can_ do this. So what?

I think the only thing we need to ensure is that the "malicious" task
which calls sys_uretprobe() can only harm itself, nothing more.

No?

Oleg.