[PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack

Liao Chang posted 1 patch 2 months, 3 weeks ago
arch/arm64/include/asm/insn.h            |  1 +
arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
arch/arm64/kernel/probes/decode-insn.h   |  1 +
arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
arch/arm64/kernel/probes/simulate-insn.h |  1 +
arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
arch/arm64/lib/insn.c                    |  5 ++
7 files changed, 134 insertions(+)
[PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Liao Chang 2 months, 3 weeks ago
This patch is the second part of a series to improve the selftest bench
of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
significantly impact uprobe/uretprobe performance at function entry in
most user cases. Profiling results below reveals the STP that executes
in the xol slot and trap back to kernel, reduce redis RPS and increase
the time of string grep obviously.

On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.

Redis GET (higher is better)
----------------------------
No uprobe: 49149.71 RPS
Single-stepped STP: 46750.82 RPS
Emulated STP: 48981.19 RPS

Redis SET (larger is better)
----------------------------
No uprobe: 49761.14 RPS
Single-stepped STP: 45255.01 RPS
Emulated stp: 48619.21 RPS

Grep (lower is better)
----------------------
No uprobe: 2.165s
Single-stepped STP: 15.314s
Emualted STP: 2.216s

Additionally, a profiling of the entry instruction for all leaf and
non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
50%. So simulting the STP on the function entry is a more viable option
for uprobe.

In the first version [1], it used a uaccess routine to simulate the STP
that push fp/lr into stack, which use double STTR instructions for
memory store. But as Mark pointed out, this approach can't simulate the
correct single-atomicity and ordering properties of STP, especiallly
when it interacts with MTE, POE, etc. So this patch uses a more complex
and inefficient approach that acquires user stack pages, maps them to
kernel address space, and allows kernel to use STP directly push fp/lr
into the stack pages.

xol-stp
-------
uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)

simulated-stp
-------------
uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)

The profiling results based on the upstream kernel with spinlock
optimization patches [2] reveals the simulation of STP increase the
uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).

[0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
[1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
[2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 arch/arm64/include/asm/insn.h            |  1 +
 arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
 arch/arm64/kernel/probes/decode-insn.h   |  1 +
 arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
 arch/arm64/kernel/probes/simulate-insn.h |  1 +
 arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
 arch/arm64/lib/insn.c                    |  5 ++
 7 files changed, 134 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index dd530d5c3d67..74e25debfa75 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -561,6 +561,7 @@ u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
 				  u32 insn, u64 imm);
 u32 aarch64_insn_decode_register(enum aarch64_insn_register_type type,
 					 u32 insn);
+u32 aarch64_insn_decode_ldst_size(u32 insn);
 u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
 				enum aarch64_insn_branch_type type);
 u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr,
diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
index be54539e309e..847a7a61ff6d 100644
--- a/arch/arm64/kernel/probes/decode-insn.c
+++ b/arch/arm64/kernel/probes/decode-insn.c
@@ -67,6 +67,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
 	return true;
 }
 
+bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn)
+{
+	/*
+	 * The 1st instruction on function entry often follows the
+	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
+	 * into stack.
+	 */
+	u32 opc = aarch64_insn_decode_ldst_size(insn);
+	u32 rt2 = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT2, insn);
+	u32 rn = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RN, insn);
+	u32 rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, insn);
+
+	return aarch64_insn_is_stp_pre(insn) &&
+	       (opc == 2) && (rt2 == 30) && (rn == 31) && (rt == 29);
+}
+
 /* Return:
  *   INSN_REJECTED     If instruction is one not allowed to kprobe,
  *   INSN_GOOD         If instruction is supported and uses instruction slot,
diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
index 8b758c5a2062..033ccab73da6 100644
--- a/arch/arm64/kernel/probes/decode-insn.h
+++ b/arch/arm64/kernel/probes/decode-insn.h
@@ -29,5 +29,6 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
 #endif
 enum probe_insn __kprobes
 arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
+bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn);
 
 #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
index 5e4f887a074c..3906851c07b2 100644
--- a/arch/arm64/kernel/probes/simulate-insn.c
+++ b/arch/arm64/kernel/probes/simulate-insn.c
@@ -8,6 +8,9 @@
 #include <linux/bitops.h>
 #include <linux/kernel.h>
 #include <linux/kprobes.h>
+#include <linux/highmem.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
 
 #include <asm/ptrace.h>
 #include <asm/traps.h>
@@ -211,3 +214,89 @@ simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
 	 */
 	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
 }
+
+static inline
+bool stack_align_check(unsigned long sp)
+{
+	return (IS_ALIGNED(sp, 16) ||
+		!(read_sysreg(sctlr_el1) & SCTLR_EL1_SA0_MASK));
+}
+
+static inline
+void put_user_stack_pages(struct page **pages, int nr_pages)
+{
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		put_page(pages[i]);
+}
+
+static inline
+int get_user_stack_pages(long start, long end, struct page **pages)
+{
+	int ret;
+	int nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
+
+	ret = get_user_pages_fast(start, nr_pages,
+				  FOLL_WRITE | FOLL_FORCE, pages);
+	if (unlikely(ret != nr_pages)) {
+		if (ret > 0)
+			put_user_stack_pages(pages, ret);
+		return 0;
+	}
+
+	return nr_pages;
+}
+
+static inline
+void *map_user_stack_pages(struct page **pages, int nr_pages)
+{
+	if (likely(nr_pages == 1))
+		return kmap_local_page(pages[0]);
+	else
+		return vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+}
+
+static inline
+void unmap_user_stack_pages(void *kaddr, int nr_pages)
+{
+	if (likely(nr_pages == 1))
+		kunmap_local(kaddr);
+	else
+		vunmap(kaddr);
+}
+
+void __kprobes
+simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
+{
+	long imm7;
+	long new_sp;
+	int nr_pages;
+	void *kaddr, *dst;
+	struct page *pages[2] = { NULL };
+
+	imm7 = aarch64_insn_decode_immediate(AARCH64_INSN_IMM_7, opcode);
+	new_sp = regs->sp + (sign_extend64(imm7, 6) << 3);
+	if (!stack_align_check(new_sp)) {
+		force_sig(SIGSEGV);
+		goto done;
+	}
+
+	nr_pages = get_user_stack_pages(new_sp, regs->sp, pages);
+	if (!nr_pages) {
+		force_sig(SIGSEGV);
+		goto done;
+	}
+
+	kaddr = map_user_stack_pages(pages, nr_pages);
+	dst = kaddr + (new_sp & ~PAGE_MASK);
+	asm volatile("stp %0, %1, [%2]"
+		     : : "r"(regs->regs[29]), "r"(regs->regs[30]), "r"(dst));
+
+	unmap_user_stack_pages(kaddr, nr_pages);
+	put_user_stack_pages(pages, nr_pages);
+
+done:
+	regs->sp = new_sp;
+	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
+}
diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
index efb2803ec943..733a47ffa2e5 100644
--- a/arch/arm64/kernel/probes/simulate-insn.h
+++ b/arch/arm64/kernel/probes/simulate-insn.h
@@ -17,5 +17,6 @@ void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
+void simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs);
 
 #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c
index d49aef2657cd..c70862314fde 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -8,6 +8,7 @@
 #include <asm/cacheflush.h>
 
 #include "decode-insn.h"
+#include "simulate-insn.h"
 
 #define UPROBE_INV_FAULT_CODE	UINT_MAX
 
@@ -31,6 +32,21 @@ unsigned long uprobe_get_swbp_addr(struct pt_regs *regs)
 	return instruction_pointer(regs);
 }
 
+static enum probe_insn
+arm_uprobe_decode_special_insn(probe_opcode_t insn, struct arch_probe_insn *api)
+{
+	/*
+	 * When uprobe interact with VMSA features, such as MTE, POE, etc, it
+	 * give up the simulation of memory access related instructions.
+	 */
+	if (!system_supports_mte() && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
+		api->handler = simulate_stp_fp_lr_sp_64b;
+		return INSN_GOOD_NO_SLOT;
+	}
+
+	return INSN_REJECTED;
+}
+
 int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
 		unsigned long addr)
 {
@@ -44,6 +60,11 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	insn = *(probe_opcode_t *)(&auprobe->insn[0]);
 
+	if (arm_uprobe_decode_special_insn(insn, &auprobe->api)) {
+		auprobe->simulate = true;
+		return 0;
+	}
+
 	switch (arm_probe_decode_insn(insn, &auprobe->api)) {
 	case INSN_REJECTED:
 		return -EINVAL;
diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c
index b008a9b46a7f..0635219d2196 100644
--- a/arch/arm64/lib/insn.c
+++ b/arch/arm64/lib/insn.c
@@ -238,6 +238,11 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type,
 	return insn;
 }
 
+u32 aarch64_insn_decode_ldst_size(u32 insn)
+{
+	return (insn & GENMASK(31, 30)) >> 30;
+}
+
 static inline long label_imm_common(unsigned long pc, unsigned long addr,
 				     long range)
 {
-- 
2.34.1

Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Mark Rutland 1 month ago
On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
> This patch is the second part of a series to improve the selftest bench
> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> significantly impact uprobe/uretprobe performance at function entry in
> most user cases. Profiling results below reveals the STP that executes
> in the xol slot and trap back to kernel, reduce redis RPS and increase
> the time of string grep obviously.
> 
> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> 
> Redis GET (higher is better)
> ----------------------------
> No uprobe: 49149.71 RPS
> Single-stepped STP: 46750.82 RPS
> Emulated STP: 48981.19 RPS
> 
> Redis SET (larger is better)
> ----------------------------
> No uprobe: 49761.14 RPS
> Single-stepped STP: 45255.01 RPS
> Emulated stp: 48619.21 RPS
> 
> Grep (lower is better)
> ----------------------
> No uprobe: 2.165s
> Single-stepped STP: 15.314s
> Emualted STP: 2.216s

The results for grep are concerning.

In theory, the overhead for stepping should be roughly double the
overhead for emulating, assuming the exception-entry and
exception-return are the dominant cost. The cost of stepping should be
trivial.

Those results show emulating adds 0.051s (for a ~2.4% overhead), while
stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
more expensive.

Was this tested bare-metal, or in a VM?

AFAICT either:

* Single-stepping is unexpectedly expensive.
 
  Historically we had performance issues with hypervisor trapping of
  debug features, and there are things we might be able to improve in
  the hypervisor and kernel, which would improve stepping *all*
  instructions.
  
  If stepping is the big problem, we could move uprobes over to a BRK
  rather than a single-step. That would require require updating and
  fixing the logic to decide which instructions are steppable, but
  that's necessary anyway given it has extant soundness issues.

* XOL management is absurdly expensive.
 
  Does uprobes keep the XOL slot around (like krpobes does), or does it
  create the slot afresh for each trap?

  If that's trying to create a slot afresh for each trap, there are
  several opportunities for improvement, e.g. keep the slot around for
  as long as the uprobe exists, or pre-allocate shared slots for common
  instructions and use those.

Mark.

> 
> Additionally, a profiling of the entry instruction for all leaf and
> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
> 50%. So simulting the STP on the function entry is a more viable option
> for uprobe.
> 
> In the first version [1], it used a uaccess routine to simulate the STP
> that push fp/lr into stack, which use double STTR instructions for
> memory store. But as Mark pointed out, this approach can't simulate the
> correct single-atomicity and ordering properties of STP, especiallly
> when it interacts with MTE, POE, etc. So this patch uses a more complex
> and inefficient approach that acquires user stack pages, maps them to
> kernel address space, and allows kernel to use STP directly push fp/lr
> into the stack pages.
> 
> xol-stp
> -------
> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
> 
> simulated-stp
> -------------
> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
> 
> The profiling results based on the upstream kernel with spinlock
> optimization patches [2] reveals the simulation of STP increase the
> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> 
> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
> 
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  arch/arm64/include/asm/insn.h            |  1 +
>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>  arch/arm64/lib/insn.c                    |  5 ++
>  7 files changed, 134 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index dd530d5c3d67..74e25debfa75 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -561,6 +561,7 @@ u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>  				  u32 insn, u64 imm);
>  u32 aarch64_insn_decode_register(enum aarch64_insn_register_type type,
>  					 u32 insn);
> +u32 aarch64_insn_decode_ldst_size(u32 insn);
>  u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
>  				enum aarch64_insn_branch_type type);
>  u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr,
> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> index be54539e309e..847a7a61ff6d 100644
> --- a/arch/arm64/kernel/probes/decode-insn.c
> +++ b/arch/arm64/kernel/probes/decode-insn.c
> @@ -67,6 +67,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>  	return true;
>  }
>  
> +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn)
> +{
> +	/*
> +	 * The 1st instruction on function entry often follows the
> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
> +	 * into stack.
> +	 */
> +	u32 opc = aarch64_insn_decode_ldst_size(insn);
> +	u32 rt2 = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT2, insn);
> +	u32 rn = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RN, insn);
> +	u32 rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, insn);
> +
> +	return aarch64_insn_is_stp_pre(insn) &&
> +	       (opc == 2) && (rt2 == 30) && (rn == 31) && (rt == 29);
> +}
> +
>  /* Return:
>   *   INSN_REJECTED     If instruction is one not allowed to kprobe,
>   *   INSN_GOOD         If instruction is supported and uses instruction slot,
> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
> index 8b758c5a2062..033ccab73da6 100644
> --- a/arch/arm64/kernel/probes/decode-insn.h
> +++ b/arch/arm64/kernel/probes/decode-insn.h
> @@ -29,5 +29,6 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>  #endif
>  enum probe_insn __kprobes
>  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
> +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn);
>  
>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> index 5e4f887a074c..3906851c07b2 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.c
> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> @@ -8,6 +8,9 @@
>  #include <linux/bitops.h>
>  #include <linux/kernel.h>
>  #include <linux/kprobes.h>
> +#include <linux/highmem.h>
> +#include <linux/vmalloc.h>
> +#include <linux/mm.h>
>  
>  #include <asm/ptrace.h>
>  #include <asm/traps.h>
> @@ -211,3 +214,89 @@ simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>  	 */
>  	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>  }
> +
> +static inline
> +bool stack_align_check(unsigned long sp)
> +{
> +	return (IS_ALIGNED(sp, 16) ||
> +		!(read_sysreg(sctlr_el1) & SCTLR_EL1_SA0_MASK));
> +}
> +
> +static inline
> +void put_user_stack_pages(struct page **pages, int nr_pages)
> +{
> +	int i;
> +
> +	for (i = 0; i < nr_pages; i++)
> +		put_page(pages[i]);
> +}
> +
> +static inline
> +int get_user_stack_pages(long start, long end, struct page **pages)
> +{
> +	int ret;
> +	int nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
> +
> +	ret = get_user_pages_fast(start, nr_pages,
> +				  FOLL_WRITE | FOLL_FORCE, pages);
> +	if (unlikely(ret != nr_pages)) {
> +		if (ret > 0)
> +			put_user_stack_pages(pages, ret);
> +		return 0;
> +	}
> +
> +	return nr_pages;
> +}
> +
> +static inline
> +void *map_user_stack_pages(struct page **pages, int nr_pages)
> +{
> +	if (likely(nr_pages == 1))
> +		return kmap_local_page(pages[0]);
> +	else
> +		return vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
> +}
> +
> +static inline
> +void unmap_user_stack_pages(void *kaddr, int nr_pages)
> +{
> +	if (likely(nr_pages == 1))
> +		kunmap_local(kaddr);
> +	else
> +		vunmap(kaddr);
> +}
> +
> +void __kprobes
> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
> +{
> +	long imm7;
> +	long new_sp;
> +	int nr_pages;
> +	void *kaddr, *dst;
> +	struct page *pages[2] = { NULL };
> +
> +	imm7 = aarch64_insn_decode_immediate(AARCH64_INSN_IMM_7, opcode);
> +	new_sp = regs->sp + (sign_extend64(imm7, 6) << 3);
> +	if (!stack_align_check(new_sp)) {
> +		force_sig(SIGSEGV);
> +		goto done;
> +	}
> +
> +	nr_pages = get_user_stack_pages(new_sp, regs->sp, pages);
> +	if (!nr_pages) {
> +		force_sig(SIGSEGV);
> +		goto done;
> +	}
> +
> +	kaddr = map_user_stack_pages(pages, nr_pages);
> +	dst = kaddr + (new_sp & ~PAGE_MASK);
> +	asm volatile("stp %0, %1, [%2]"
> +		     : : "r"(regs->regs[29]), "r"(regs->regs[30]), "r"(dst));
> +
> +	unmap_user_stack_pages(kaddr, nr_pages);
> +	put_user_stack_pages(pages, nr_pages);
> +
> +done:
> +	regs->sp = new_sp;
> +	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> +}
> diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
> index efb2803ec943..733a47ffa2e5 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.h
> +++ b/arch/arm64/kernel/probes/simulate-insn.h
> @@ -17,5 +17,6 @@ void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
> +void simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs);
>  
>  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
> diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c
> index d49aef2657cd..c70862314fde 100644
> --- a/arch/arm64/kernel/probes/uprobes.c
> +++ b/arch/arm64/kernel/probes/uprobes.c
> @@ -8,6 +8,7 @@
>  #include <asm/cacheflush.h>
>  
>  #include "decode-insn.h"
> +#include "simulate-insn.h"
>  
>  #define UPROBE_INV_FAULT_CODE	UINT_MAX
>  
> @@ -31,6 +32,21 @@ unsigned long uprobe_get_swbp_addr(struct pt_regs *regs)
>  	return instruction_pointer(regs);
>  }
>  
> +static enum probe_insn
> +arm_uprobe_decode_special_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> +{
> +	/*
> +	 * When uprobe interact with VMSA features, such as MTE, POE, etc, it
> +	 * give up the simulation of memory access related instructions.
> +	 */
> +	if (!system_supports_mte() && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
> +		api->handler = simulate_stp_fp_lr_sp_64b;
> +		return INSN_GOOD_NO_SLOT;
> +	}
> +
> +	return INSN_REJECTED;
> +}
> +
>  int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  		unsigned long addr)
>  {
> @@ -44,6 +60,11 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  
>  	insn = *(probe_opcode_t *)(&auprobe->insn[0]);
>  
> +	if (arm_uprobe_decode_special_insn(insn, &auprobe->api)) {
> +		auprobe->simulate = true;
> +		return 0;
> +	}
> +
>  	switch (arm_probe_decode_insn(insn, &auprobe->api)) {
>  	case INSN_REJECTED:
>  		return -EINVAL;
> diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c
> index b008a9b46a7f..0635219d2196 100644
> --- a/arch/arm64/lib/insn.c
> +++ b/arch/arm64/lib/insn.c
> @@ -238,6 +238,11 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type,
>  	return insn;
>  }
>  
> +u32 aarch64_insn_decode_ldst_size(u32 insn)
> +{
> +	return (insn & GENMASK(31, 30)) >> 30;
> +}
> +
>  static inline long label_imm_common(unsigned long pc, unsigned long addr,
>  				     long range)
>  {
> -- 
> 2.34.1
> 
Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Liao, Chang 1 month ago

在 2024/10/24 22:06, Mark Rutland 写道:
> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
>> This patch is the second part of a series to improve the selftest bench
>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
>> significantly impact uprobe/uretprobe performance at function entry in
>> most user cases. Profiling results below reveals the STP that executes
>> in the xol slot and trap back to kernel, reduce redis RPS and increase
>> the time of string grep obviously.
>>
>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>
>> Redis GET (higher is better)
>> ----------------------------
>> No uprobe: 49149.71 RPS
>> Single-stepped STP: 46750.82 RPS
>> Emulated STP: 48981.19 RPS
>>
>> Redis SET (larger is better)
>> ----------------------------
>> No uprobe: 49761.14 RPS
>> Single-stepped STP: 45255.01 RPS
>> Emulated stp: 48619.21 RPS
>>
>> Grep (lower is better)
>> ----------------------
>> No uprobe: 2.165s
>> Single-stepped STP: 15.314s
>> Emualted STP: 2.216s
> 
> The results for grep are concerning.
> 
> In theory, the overhead for stepping should be roughly double the
> overhead for emulating, assuming the exception-entry and
> exception-return are the dominant cost. The cost of stepping should be
> trivial.
> 
> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
> more expensive.

To provide more context, I set an uprobe at the grep_file() function from
the grep source code [1], which is likely on very hot path. I suspect that
the high frequency of uprobe traps contribute to the signficant overhead.
I'm not entirely certain about this, and I'll need to investigate further
to pinpoint the source of 13.149s.

https://github.com/brgl/busybox/blob/master/findutils/grep.c#L302

> 
> Was this tested bare-metal, or in a VM?

bare-metal.

> 
> AFAICT either:
> 
> * Single-stepping is unexpectedly expensive.
>  
>   Historically we had performance issues with hypervisor trapping of
>   debug features, and there are things we might be able to improve in
>   the hypervisor and kernel, which would improve stepping *all*
>   instructions.
>   
>   If stepping is the big problem, we could move uprobes over to a BRK
>   rather than a single-step. That would require require updating and
>   fixing the logic to decide which instructions are steppable, but
>   that's necessary anyway given it has extant soundness issues.

That's a very helpful input. I'll rerun the profiling in a VM.

> 
> * XOL management is absurdly expensive.
>  
>   Does uprobes keep the XOL slot around (like krpobes does), or does it
>   create the slot afresh for each trap?

As I know, it doesn't create a new slot for each trap. Instead, it grabs
an slot from a pre-allocated pool. I believe the cost of acquring a slot
from this pool is relatively low compared to creating a new one.

> 
>   If that's trying to create a slot afresh for each trap, there are
>   several opportunities for improvement, e.g. keep the slot around for
>   as long as the uprobe exists, or pre-allocate shared slots for common
>   instructions and use those.

I have sent a patch [2] that aimed to improve scalability of uprobe handling.
While that patch focuses on reduing the cacheline bouncing caused by atomic
operations when grabbing and releasing slot from a pool, I wouldn't say
this has a impact on grep, as it typically use a single thread for pattern
matching on my sytem.

[2] https://lore.kernel.org/all/20240927094549.3382916-1-liaochang1@huawei.com/

> 
> Mark.
> 
>>
>> Additionally, a profiling of the entry instruction for all leaf and
>> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
>> 50%. So simulting the STP on the function entry is a more viable option
>> for uprobe.
>>
>> In the first version [1], it used a uaccess routine to simulate the STP
>> that push fp/lr into stack, which use double STTR instructions for
>> memory store. But as Mark pointed out, this approach can't simulate the
>> correct single-atomicity and ordering properties of STP, especiallly
>> when it interacts with MTE, POE, etc. So this patch uses a more complex
>> and inefficient approach that acquires user stack pages, maps them to
>> kernel address space, and allows kernel to use STP directly push fp/lr
>> into the stack pages.
>>
>> xol-stp
>> -------
>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
>>
>> simulated-stp
>> -------------
>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
>>
>> The profiling results based on the upstream kernel with spinlock
>> optimization patches [2] reveals the simulation of STP increase the
>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
>>
>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
>>
>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>> ---
>>  arch/arm64/include/asm/insn.h            |  1 +
>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>>  arch/arm64/lib/insn.c                    |  5 ++
>>  7 files changed, 134 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>> index dd530d5c3d67..74e25debfa75 100644
>> --- a/arch/arm64/include/asm/insn.h
>> +++ b/arch/arm64/include/asm/insn.h
>> @@ -561,6 +561,7 @@ u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>  				  u32 insn, u64 imm);
>>  u32 aarch64_insn_decode_register(enum aarch64_insn_register_type type,
>>  					 u32 insn);
>> +u32 aarch64_insn_decode_ldst_size(u32 insn);
>>  u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
>>  				enum aarch64_insn_branch_type type);
>>  u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr,
>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>> index be54539e309e..847a7a61ff6d 100644
>> --- a/arch/arm64/kernel/probes/decode-insn.c
>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>> @@ -67,6 +67,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>  	return true;
>>  }
>>  
>> +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn)
>> +{
>> +	/*
>> +	 * The 1st instruction on function entry often follows the
>> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>> +	 * into stack.
>> +	 */
>> +	u32 opc = aarch64_insn_decode_ldst_size(insn);
>> +	u32 rt2 = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT2, insn);
>> +	u32 rn = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RN, insn);
>> +	u32 rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, insn);
>> +
>> +	return aarch64_insn_is_stp_pre(insn) &&
>> +	       (opc == 2) && (rt2 == 30) && (rn == 31) && (rt == 29);
>> +}
>> +
>>  /* Return:
>>   *   INSN_REJECTED     If instruction is one not allowed to kprobe,
>>   *   INSN_GOOD         If instruction is supported and uses instruction slot,
>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>> index 8b758c5a2062..033ccab73da6 100644
>> --- a/arch/arm64/kernel/probes/decode-insn.h
>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>> @@ -29,5 +29,6 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>  #endif
>>  enum probe_insn __kprobes
>>  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>> +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn);
>>  
>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>> index 5e4f887a074c..3906851c07b2 100644
>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>> @@ -8,6 +8,9 @@
>>  #include <linux/bitops.h>
>>  #include <linux/kernel.h>
>>  #include <linux/kprobes.h>
>> +#include <linux/highmem.h>
>> +#include <linux/vmalloc.h>
>> +#include <linux/mm.h>
>>  
>>  #include <asm/ptrace.h>
>>  #include <asm/traps.h>
>> @@ -211,3 +214,89 @@ simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>  	 */
>>  	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>>  }
>> +
>> +static inline
>> +bool stack_align_check(unsigned long sp)
>> +{
>> +	return (IS_ALIGNED(sp, 16) ||
>> +		!(read_sysreg(sctlr_el1) & SCTLR_EL1_SA0_MASK));
>> +}
>> +
>> +static inline
>> +void put_user_stack_pages(struct page **pages, int nr_pages)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < nr_pages; i++)
>> +		put_page(pages[i]);
>> +}
>> +
>> +static inline
>> +int get_user_stack_pages(long start, long end, struct page **pages)
>> +{
>> +	int ret;
>> +	int nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
>> +
>> +	ret = get_user_pages_fast(start, nr_pages,
>> +				  FOLL_WRITE | FOLL_FORCE, pages);
>> +	if (unlikely(ret != nr_pages)) {
>> +		if (ret > 0)
>> +			put_user_stack_pages(pages, ret);
>> +		return 0;
>> +	}
>> +
>> +	return nr_pages;
>> +}
>> +
>> +static inline
>> +void *map_user_stack_pages(struct page **pages, int nr_pages)
>> +{
>> +	if (likely(nr_pages == 1))
>> +		return kmap_local_page(pages[0]);
>> +	else
>> +		return vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
>> +}
>> +
>> +static inline
>> +void unmap_user_stack_pages(void *kaddr, int nr_pages)
>> +{
>> +	if (likely(nr_pages == 1))
>> +		kunmap_local(kaddr);
>> +	else
>> +		vunmap(kaddr);
>> +}
>> +
>> +void __kprobes
>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>> +{
>> +	long imm7;
>> +	long new_sp;
>> +	int nr_pages;
>> +	void *kaddr, *dst;
>> +	struct page *pages[2] = { NULL };
>> +
>> +	imm7 = aarch64_insn_decode_immediate(AARCH64_INSN_IMM_7, opcode);
>> +	new_sp = regs->sp + (sign_extend64(imm7, 6) << 3);
>> +	if (!stack_align_check(new_sp)) {
>> +		force_sig(SIGSEGV);
>> +		goto done;
>> +	}
>> +
>> +	nr_pages = get_user_stack_pages(new_sp, regs->sp, pages);
>> +	if (!nr_pages) {
>> +		force_sig(SIGSEGV);
>> +		goto done;
>> +	}
>> +
>> +	kaddr = map_user_stack_pages(pages, nr_pages);
>> +	dst = kaddr + (new_sp & ~PAGE_MASK);
>> +	asm volatile("stp %0, %1, [%2]"
>> +		     : : "r"(regs->regs[29]), "r"(regs->regs[30]), "r"(dst));
>> +
>> +	unmap_user_stack_pages(kaddr, nr_pages);
>> +	put_user_stack_pages(pages, nr_pages);
>> +
>> +done:
>> +	regs->sp = new_sp;
>> +	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>> +}
>> diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
>> index efb2803ec943..733a47ffa2e5 100644
>> --- a/arch/arm64/kernel/probes/simulate-insn.h
>> +++ b/arch/arm64/kernel/probes/simulate-insn.h
>> @@ -17,5 +17,6 @@ void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
>>  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
>>  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
>>  void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
>> +void simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs);
>>  
>>  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
>> diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c
>> index d49aef2657cd..c70862314fde 100644
>> --- a/arch/arm64/kernel/probes/uprobes.c
>> +++ b/arch/arm64/kernel/probes/uprobes.c
>> @@ -8,6 +8,7 @@
>>  #include <asm/cacheflush.h>
>>  
>>  #include "decode-insn.h"
>> +#include "simulate-insn.h"
>>  
>>  #define UPROBE_INV_FAULT_CODE	UINT_MAX
>>  
>> @@ -31,6 +32,21 @@ unsigned long uprobe_get_swbp_addr(struct pt_regs *regs)
>>  	return instruction_pointer(regs);
>>  }
>>  
>> +static enum probe_insn
>> +arm_uprobe_decode_special_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>> +{
>> +	/*
>> +	 * When uprobe interact with VMSA features, such as MTE, POE, etc, it
>> +	 * give up the simulation of memory access related instructions.
>> +	 */
>> +	if (!system_supports_mte() && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>> +		api->handler = simulate_stp_fp_lr_sp_64b;
>> +		return INSN_GOOD_NO_SLOT;
>> +	}
>> +
>> +	return INSN_REJECTED;
>> +}
>> +
>>  int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>  		unsigned long addr)
>>  {
>> @@ -44,6 +60,11 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>  
>>  	insn = *(probe_opcode_t *)(&auprobe->insn[0]);
>>  
>> +	if (arm_uprobe_decode_special_insn(insn, &auprobe->api)) {
>> +		auprobe->simulate = true;
>> +		return 0;
>> +	}
>> +
>>  	switch (arm_probe_decode_insn(insn, &auprobe->api)) {
>>  	case INSN_REJECTED:
>>  		return -EINVAL;
>> diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c
>> index b008a9b46a7f..0635219d2196 100644
>> --- a/arch/arm64/lib/insn.c
>> +++ b/arch/arm64/lib/insn.c
>> @@ -238,6 +238,11 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type,
>>  	return insn;
>>  }
>>  
>> +u32 aarch64_insn_decode_ldst_size(u32 insn)
>> +{
>> +	return (insn & GENMASK(31, 30)) >> 30;
>> +}
>> +
>>  static inline long label_imm_common(unsigned long pc, unsigned long addr,
>>  				     long range)
>>  {
>> -- 
>> 2.34.1
>>
> 

-- 
BR
Liao, Chang

Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Andrii Nakryiko 1 month ago
On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
> > This patch is the second part of a series to improve the selftest bench
> > of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> > significantly impact uprobe/uretprobe performance at function entry in
> > most user cases. Profiling results below reveals the STP that executes
> > in the xol slot and trap back to kernel, reduce redis RPS and increase
> > the time of string grep obviously.
> >
> > On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >
> > Redis GET (higher is better)
> > ----------------------------
> > No uprobe: 49149.71 RPS
> > Single-stepped STP: 46750.82 RPS
> > Emulated STP: 48981.19 RPS
> >
> > Redis SET (larger is better)
> > ----------------------------
> > No uprobe: 49761.14 RPS
> > Single-stepped STP: 45255.01 RPS
> > Emulated stp: 48619.21 RPS
> >
> > Grep (lower is better)
> > ----------------------
> > No uprobe: 2.165s
> > Single-stepped STP: 15.314s
> > Emualted STP: 2.216s
>
> The results for grep are concerning.
>
> In theory, the overhead for stepping should be roughly double the
> overhead for emulating, assuming the exception-entry and
> exception-return are the dominant cost. The cost of stepping should be
> trivial.
>
> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
> more expensive.
>
> Was this tested bare-metal, or in a VM?

Hey Mark, I hope Liao will have a chance to reply, I don't know the
details of his benchmarking. But I can try to give you my numbers and
maybe answer a few questions, hopefully that helps move the
conversation forward.

So, first of all, I did a quick benchmark on bare metal (without
Liao's optimization, though), here are my results:

uprobe-nop            ( 1 cpus):    2.334 ± 0.011M/s  (  2.334M/s/cpu)
uprobe-push           ( 1 cpus):    2.321 ± 0.010M/s  (  2.321M/s/cpu)
uprobe-ret            ( 1 cpus):    4.144 ± 0.041M/s  (  4.144M/s/cpu)

uretprobe-nop         ( 1 cpus):    1.684 ± 0.004M/s  (  1.684M/s/cpu)
uretprobe-push        ( 1 cpus):    1.736 ± 0.003M/s  (  1.736M/s/cpu)
uretprobe-ret         ( 1 cpus):    2.502 ± 0.006M/s  (  2.502M/s/cpu)

uretprobes are inherently slower, so I'll just compare uprobe, as the
differences are very clear either way.

-nop is literally nop (Liao solved that issue, I just don't have his
patch applied on my test machine). -push has `stp     x29, x30, [sp,
#-0x10]!` instruction traced. -ret is literally just `ret`
instruction.

So you can see that -ret is almost twice as fast as the -push variant
(it's a microbenchmark, yes, but still).

>
> AFAICT either:
>
> * Single-stepping is unexpectedly expensive.
>
>   Historically we had performance issues with hypervisor trapping of
>   debug features, and there are things we might be able to improve in
>   the hypervisor and kernel, which would improve stepping *all*
>   instructions.
>

Single-stepping will always be more expensive, as it necessitates
extra hop kernel->user space->kernel, so no matter the optimization
for single-stepping, if we can avoid it, we should. It will be
noticeable.

>   If stepping is the big problem, we could move uprobes over to a BRK
>   rather than a single-step. That would require require updating and
>   fixing the logic to decide which instructions are steppable, but
>   that's necessary anyway given it has extant soundness issues.

I'm afraid I don't understand what BRK means and what are the
consequences in terms of overheads. I'm not an ARM person either, so
sorry if that's a stupid question. But either way, I can't address
this. But see above, emulating an instruction feels like a much better
approach, if possible.

>
> * XOL management is absurdly expensive.
>
>   Does uprobes keep the XOL slot around (like krpobes does), or does it
>   create the slot afresh for each trap?

XOL *page* is created once per process, lazily, and then we just
juggle a bunch of fixed slots there for each instance of
single-stepped uprobe. And yes, there are some bottlenecks in XOL
management, though it's mostly due to lock contention (as it is
implemented right now). Liao and Oleg have been improving XOL
management, but still, avoiding XOL in the first place is the much
preferred way.

>
>   If that's trying to create a slot afresh for each trap, there are
>   several opportunities for improvement, e.g. keep the slot around for
>   as long as the uprobe exists, or pre-allocate shared slots for common
>   instructions and use those.

As I mentioned, a XOL page is allocated and mapped once, but yes, it
seems like we dynamically get a slot in it for each single-stepped
execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's
probably not a bad idea to just cache and hold a XOL slot for each
specific uprobe, I don't see why we should limit ourselves to just one
XOL page. We also don't need to pre-size each slot, we can probably
allocate just the right amount of space for a given uprobe.

All good ideas for sure, we should do them, IMO. But we'll still be
paying an extra kernel->user->kernel switch, which almost certainly is
slower than doing a simple stack push emulation just like we do in
x86-64 case, no?


BTW, I did a quick local profiling run. I don't think XOL management
is the main source of overhead. I see 5% of CPU cycles spent in
arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
traces. There are at least 22% CPU cycles spent in some
local_daif_restore function, though, not sure what that is, but might
be related to interrupt handling, right?


The take away I'd like to communicate here is avoiding the
single-stepping need is *the best way* to go, IMO. So if we can
emulate those STP instructions for uprobe *cheaply*, that would be
awesome.

>
> Mark.
>
> >
> > Additionally, a profiling of the entry instruction for all leaf and
> > non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
> > 50%. So simulting the STP on the function entry is a more viable option
> > for uprobe.
> >
> > In the first version [1], it used a uaccess routine to simulate the STP
> > that push fp/lr into stack, which use double STTR instructions for
> > memory store. But as Mark pointed out, this approach can't simulate the
> > correct single-atomicity and ordering properties of STP, especiallly
> > when it interacts with MTE, POE, etc. So this patch uses a more complex
> > and inefficient approach that acquires user stack pages, maps them to
> > kernel address space, and allows kernel to use STP directly push fp/lr
> > into the stack pages.
> >
> > xol-stp
> > -------
> > uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> > uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> > uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> > uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> > uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> > uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
> >
> > simulated-stp
> > -------------
> > uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> > uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> > uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> > uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> > uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> > uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
> >
> > The profiling results based on the upstream kernel with spinlock
> > optimization patches [2] reveals the simulation of STP increase the
> > uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> > uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> >
> > [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> > [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> > [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
> >
> > Signed-off-by: Liao Chang <liaochang1@huawei.com>
> > ---
> >  arch/arm64/include/asm/insn.h            |  1 +
> >  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
> >  arch/arm64/kernel/probes/decode-insn.h   |  1 +
> >  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
> >  arch/arm64/kernel/probes/simulate-insn.h |  1 +
> >  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
> >  arch/arm64/lib/insn.c                    |  5 ++
> >  7 files changed, 134 insertions(+)
> >

[...]
Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Liao, Chang 3 weeks, 3 days ago
Andrii and Mark.

在 2024/10/26 4:51, Andrii Nakryiko 写道:
> On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
>>
>> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
>>> This patch is the second part of a series to improve the selftest bench
>>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
>>> significantly impact uprobe/uretprobe performance at function entry in
>>> most user cases. Profiling results below reveals the STP that executes
>>> in the xol slot and trap back to kernel, reduce redis RPS and increase
>>> the time of string grep obviously.
>>>
>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>
>>> Redis GET (higher is better)
>>> ----------------------------
>>> No uprobe: 49149.71 RPS
>>> Single-stepped STP: 46750.82 RPS
>>> Emulated STP: 48981.19 RPS
>>>
>>> Redis SET (larger is better)
>>> ----------------------------
>>> No uprobe: 49761.14 RPS
>>> Single-stepped STP: 45255.01 RPS
>>> Emulated stp: 48619.21 RPS
>>>
>>> Grep (lower is better)
>>> ----------------------
>>> No uprobe: 2.165s
>>> Single-stepped STP: 15.314s
>>> Emualted STP: 2.216s
>>
>> The results for grep are concerning.
>>
>> In theory, the overhead for stepping should be roughly double the
>> overhead for emulating, assuming the exception-entry and
>> exception-return are the dominant cost. The cost of stepping should be
>> trivial.
>>
>> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
>> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
>> more expensive.
>>
>> Was this tested bare-metal, or in a VM?
> 
> Hey Mark, I hope Liao will have a chance to reply, I don't know the
> details of his benchmarking. But I can try to give you my numbers and
> maybe answer a few questions, hopefully that helps move the
> conversation forward.
> 
> So, first of all, I did a quick benchmark on bare metal (without
> Liao's optimization, though), here are my results:
> 
> uprobe-nop            ( 1 cpus):    2.334 ± 0.011M/s  (  2.334M/s/cpu)
> uprobe-push           ( 1 cpus):    2.321 ± 0.010M/s  (  2.321M/s/cpu)
> uprobe-ret            ( 1 cpus):    4.144 ± 0.041M/s  (  4.144M/s/cpu)
> 
> uretprobe-nop         ( 1 cpus):    1.684 ± 0.004M/s  (  1.684M/s/cpu)
> uretprobe-push        ( 1 cpus):    1.736 ± 0.003M/s  (  1.736M/s/cpu)
> uretprobe-ret         ( 1 cpus):    2.502 ± 0.006M/s  (  2.502M/s/cpu)
> 
> uretprobes are inherently slower, so I'll just compare uprobe, as the
> differences are very clear either way.
> 
> -nop is literally nop (Liao solved that issue, I just don't have his
> patch applied on my test machine). -push has `stp     x29, x30, [sp,
> #-0x10]!` instruction traced. -ret is literally just `ret`
> instruction.
> 
> So you can see that -ret is almost twice as fast as the -push variant
> (it's a microbenchmark, yes, but still).
> 
>>
>> AFAICT either:
>>
>> * Single-stepping is unexpectedly expensive.
>>
>>   Historically we had performance issues with hypervisor trapping of
>>   debug features, and there are things we might be able to improve in
>>   the hypervisor and kernel, which would improve stepping *all*
>>   instructions.
>>
> 
> Single-stepping will always be more expensive, as it necessitates
> extra hop kernel->user space->kernel, so no matter the optimization
> for single-stepping, if we can avoid it, we should. It will be
> noticeable.
> 
>>   If stepping is the big problem, we could move uprobes over to a BRK
>>   rather than a single-step. That would require require updating and
>>   fixing the logic to decide which instructions are steppable, but
>>   that's necessary anyway given it has extant soundness issues.
> 
> I'm afraid I don't understand what BRK means and what are the
> consequences in terms of overheads. I'm not an ARM person either, so
> sorry if that's a stupid question. But either way, I can't address
> this. But see above, emulating an instruction feels like a much better
> approach, if possible.

As I understand, Mark's suggestion is to place a BRK instruction next to
the instruction in the xol slot. Once the instruction in the xol slot
executed, the BRK instruction would trigger a trap into kernel. This is
a common technique used on platforms that don't support hardware single-
step. However, since Arm64 does support hardware single-stepping, kernel
enables it in pre_ssout(), allowing the CPU to automatically trap into kernel
after instruction in xol slot executed. But even we move uprobes over
to a BRK rather than a single-step. It can't reduce the overhead of user->
kernel->user context switch on the bare-metal. Maybe I am wrong, Mark,
could you give more details about the BRK.

> 
>>
>> * XOL management is absurdly expensive.
>>
>>   Does uprobes keep the XOL slot around (like krpobes does), or does it
>>   create the slot afresh for each trap?
> 
> XOL *page* is created once per process, lazily, and then we just
> juggle a bunch of fixed slots there for each instance of
> single-stepped uprobe. And yes, there are some bottlenecks in XOL
> management, though it's mostly due to lock contention (as it is
> implemented right now). Liao and Oleg have been improving XOL
> management, but still, avoiding XOL in the first place is the much
> preferred way.
> 
>>
>>   If that's trying to create a slot afresh for each trap, there are
>>   several opportunities for improvement, e.g. keep the slot around for
>>   as long as the uprobe exists, or pre-allocate shared slots for common
>>   instructions and use those.
> 
> As I mentioned, a XOL page is allocated and mapped once, but yes, it
> seems like we dynamically get a slot in it for each single-stepped
> execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's
> probably not a bad idea to just cache and hold a XOL slot for each
> specific uprobe, I don't see why we should limit ourselves to just one
> XOL page. We also don't need to pre-size each slot, we can probably
> allocate just the right amount of space for a given uprobe.
> 
> All good ideas for sure, we should do them, IMO. But we'll still be
> paying an extra kernel->user->kernel switch, which almost certainly is
> slower than doing a simple stack push emulation just like we do in
> x86-64 case, no?
> 
> 
> BTW, I did a quick local profiling run. I don't think XOL management
> is the main source of overhead. I see 5% of CPU cycles spent in
> arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
> traces. There are at least 22% CPU cycles spent in some
> local_daif_restore function, though, not sure what that is, but might
> be related to interrupt handling, right?

The local_daif_restore() is part of the path for all user->kernel->user
context switch, including interrupt handling, breakpoints, and single-stepping
etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't
been enable to reproduce this on my local machine.

Andrii, could you use the patch below to see if it can reduce the 5% of
CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache
synchronization is the cause of this part of overhead.

https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/

> 
> 
> The take away I'd like to communicate here is avoiding the
> single-stepping need is *the best way* to go, IMO. So if we can
> emulate those STP instructions for uprobe *cheaply*, that would be
> awesome.

Given some significant uprobe optimizations from Oleg and Andrii
merged, I am curious to see how these changes impact the profiling
result on Arm64. So I re-ran the selftest bench on the latest kernel
(based on tag next-20241104) and the kernel (based on tag next-20240909)
that I used when I submitted this patch. The results re-ran are shown
below.

next-20240909(xol stp + xol nop)
--------------------------------
uprobe-nop      ( 1 cpus):    0.424 ± 0.000M/s  (  0.424M/s/cpu)
uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
uprobe-ret      ( 1 cpus):    2.101 ± 0.002M/s  (  2.101M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.347 ± 0.000M/s  (  0.347M/s/cpu)
uretprobe-push  ( 1 cpus):    0.349 ± 0.000M/s  (  0.349M/s/cpu)
uretprobe-ret   ( 1 cpus):    1.051 ± 0.001M/s  (  1.051M/s/cpu)

next-20240909(sim stp + sim nop)
--------------------------------
uprobe-nop      ( 1 cpus):    2.042 ± 0.002M/s  (  2.042M/s/cpu)
uprobe-push     ( 1 cpus):    1.363 ± 0.002M/s  (  1.363M/s/cpu)
uprobe-ret      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
uretprobe-nop   ( 1 cpus):    1.049 ± 0.001M/s  (  1.049M/s/cpu)
uretprobe-push  ( 1 cpus):    0.780 ± 0.000M/s  (  0.780M/s/cpu)
uretprobe-ret   ( 1 cpus):    1.065 ± 0.001M/s  (  1.065M/s/cpu)

next-20241104 (xol stp + sim nop)
---------------------------------
uprobe-nop      ( 1 cpus):    2.044 ± 0.003M/s  (  2.044M/s/cpu)
uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
uprobe-ret      ( 1 cpus):    2.047 ± 0.001M/s  (  2.047M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.832 ± 0.003M/s  (  0.832M/s/cpu)
uretprobe-push  ( 1 cpus):    0.328 ± 0.000M/s  (  0.328M/s/cpu)
uretprobe-ret   ( 1 cpus):    0.833 ± 0.003M/s  (  0.833M/s/cpu)

next-20241104 (sim stp + sim nop)
---------------------------------
uprobe-nop      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
uprobe-push     ( 1 cpus):    1.411 ± 0.002M/s  (  1.411M/s/cpu)
uprobe-ret      ( 1 cpus):    2.052 ± 0.005M/s  (  2.052M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.839 ± 0.005M/s  (  0.839M/s/cpu)
uretprobe-push  ( 1 cpus):    0.702 ± 0.002M/s  (  0.702M/s/cpu)
uretprobe-ret   ( 1 cpus):    0.837 ± 0.001M/s  (  0.837M/s/cpu)

It seems that the STP simluation approach in this patch significantly
improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s)
and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels
bases on next-20240909 and next-20241104. While there is still room
for improvement to reach the throughput of -nop and -ret, the gains
are very substantail.

But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using
single-stepping stp, which are far lower compared to the result when
when I submitted patch(look closely to the uprobe-push and uretprobe-push
results in commit log). I'm certain that the tests were run on the
same bare-metal machine with background tasked minimized. I doubt some
uncommitted uprobe optimization on my local repo twist the result of
-push using single-step.

In addition to the micro benchmark, I also re-ran Redis benchmark to
compare the impact of single-stepping STP and simluated STP to the
throughput of redis-server. I believe the impact of uprobe on the real
application depends on the frequency of uprobe triggered and the application's
hot paths. Therefore, I wouldn't say the simluated STP will benefit all
real world applications.

$ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET
$ redis-server --port 7778 --protected-mode no --save "" --appendonly no & &&
  bpftrace -e 'uprobe:redis-server:readQueryFromClient{}
               uprobe:redis-server:processCommand{}
	       uprobe:redis-server:aeApiPoll {}'

next-20241104
-------------
RPS: 55602.1

next-20241104 + ss stp
----------------------
RPS: 47220.9
uprobe@@aeApiPoll: 554565
uprobe@processCommand: 1275160
uprobe@readQueryFromClient: 1277710

next-20241104 + sim stp
-----------------------
RPS           54290.09
uprobe@aeApiPoll: 496007
uprobe@processCommand: 1275160
uprobe@readQueryFromClient: 1277710

Andrii expressed concern that the STP simulation in this patch is too
expensive. If we believe the result I re-ran, perhaps it is not a
bad way to simluate STP. Looking forward to your feedbacks, or someone
could propose a cheaper way to simluate STP, I'm very happy to test it
on my machine, thanks.

[...]

>>>
>>> xol-stp
>>> -------
>>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
>>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
>>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
>>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
>>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
>>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
>>>
>>> simulated-stp
>>> -------------
>>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
>>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
>>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
>>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
>>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
>>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
>>>
>>> The profiling results based on the upstream kernel with spinlock
>>> optimization patches [2] reveals the simulation of STP increase the
>>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
>>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
>>>
>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
>>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
>>>
>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>> ---
>>>  arch/arm64/include/asm/insn.h            |  1 +
>>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>>>  arch/arm64/lib/insn.c                    |  5 ++
>>>  7 files changed, 134 insertions(+)
>>>
> 
> [...]

-- 
BR
Liao, Chang

Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Andrii Nakryiko 3 weeks, 2 days ago
On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
> Andrii and Mark.
>
> 在 2024/10/26 4:51, Andrii Nakryiko 写道:
> > On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >>
> >> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
> >>> This patch is the second part of a series to improve the selftest bench
> >>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> >>> significantly impact uprobe/uretprobe performance at function entry in
> >>> most user cases. Profiling results below reveals the STP that executes
> >>> in the xol slot and trap back to kernel, reduce redis RPS and increase
> >>> the time of string grep obviously.
> >>>
> >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>>
> >>> Redis GET (higher is better)
> >>> ----------------------------
> >>> No uprobe: 49149.71 RPS
> >>> Single-stepped STP: 46750.82 RPS
> >>> Emulated STP: 48981.19 RPS
> >>>
> >>> Redis SET (larger is better)
> >>> ----------------------------
> >>> No uprobe: 49761.14 RPS
> >>> Single-stepped STP: 45255.01 RPS
> >>> Emulated stp: 48619.21 RPS
> >>>
> >>> Grep (lower is better)
> >>> ----------------------
> >>> No uprobe: 2.165s
> >>> Single-stepped STP: 15.314s
> >>> Emualted STP: 2.216s
> >>
> >> The results for grep are concerning.
> >>
> >> In theory, the overhead for stepping should be roughly double the
> >> overhead for emulating, assuming the exception-entry and
> >> exception-return are the dominant cost. The cost of stepping should be
> >> trivial.
> >>
> >> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
> >> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
> >> more expensive.
> >>
> >> Was this tested bare-metal, or in a VM?
> >
> > Hey Mark, I hope Liao will have a chance to reply, I don't know the
> > details of his benchmarking. But I can try to give you my numbers and
> > maybe answer a few questions, hopefully that helps move the
> > conversation forward.
> >
> > So, first of all, I did a quick benchmark on bare metal (without
> > Liao's optimization, though), here are my results:
> >
> > uprobe-nop            ( 1 cpus):    2.334 ± 0.011M/s  (  2.334M/s/cpu)
> > uprobe-push           ( 1 cpus):    2.321 ± 0.010M/s  (  2.321M/s/cpu)
> > uprobe-ret            ( 1 cpus):    4.144 ± 0.041M/s  (  4.144M/s/cpu)
> >
> > uretprobe-nop         ( 1 cpus):    1.684 ± 0.004M/s  (  1.684M/s/cpu)
> > uretprobe-push        ( 1 cpus):    1.736 ± 0.003M/s  (  1.736M/s/cpu)
> > uretprobe-ret         ( 1 cpus):    2.502 ± 0.006M/s  (  2.502M/s/cpu)
> >
> > uretprobes are inherently slower, so I'll just compare uprobe, as the
> > differences are very clear either way.
> >
> > -nop is literally nop (Liao solved that issue, I just don't have his
> > patch applied on my test machine). -push has `stp     x29, x30, [sp,
> > #-0x10]!` instruction traced. -ret is literally just `ret`
> > instruction.
> >
> > So you can see that -ret is almost twice as fast as the -push variant
> > (it's a microbenchmark, yes, but still).
> >
> >>
> >> AFAICT either:
> >>
> >> * Single-stepping is unexpectedly expensive.
> >>
> >>   Historically we had performance issues with hypervisor trapping of
> >>   debug features, and there are things we might be able to improve in
> >>   the hypervisor and kernel, which would improve stepping *all*
> >>   instructions.
> >>
> >
> > Single-stepping will always be more expensive, as it necessitates
> > extra hop kernel->user space->kernel, so no matter the optimization
> > for single-stepping, if we can avoid it, we should. It will be
> > noticeable.
> >
> >>   If stepping is the big problem, we could move uprobes over to a BRK
> >>   rather than a single-step. That would require require updating and
> >>   fixing the logic to decide which instructions are steppable, but
> >>   that's necessary anyway given it has extant soundness issues.
> >
> > I'm afraid I don't understand what BRK means and what are the
> > consequences in terms of overheads. I'm not an ARM person either, so
> > sorry if that's a stupid question. But either way, I can't address
> > this. But see above, emulating an instruction feels like a much better
> > approach, if possible.
>
> As I understand, Mark's suggestion is to place a BRK instruction next to
> the instruction in the xol slot. Once the instruction in the xol slot
> executed, the BRK instruction would trigger a trap into kernel. This is
> a common technique used on platforms that don't support hardware single-
> step. However, since Arm64 does support hardware single-stepping, kernel
> enables it in pre_ssout(), allowing the CPU to automatically trap into kernel
> after instruction in xol slot executed. But even we move uprobes over
> to a BRK rather than a single-step. It can't reduce the overhead of user->
> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark,
> could you give more details about the BRK.
>

I see, thanks for elaborating. So the suggestion was to go from very
expensive single-stepping mode to still expensive breakpoint-based
kernel->user->kernel workflow.

I think either way it's going to be much slower than avoiding
kernel->user->kernel hop, so we should emulate STP instead, yep.

> >
> >>
> >> * XOL management is absurdly expensive.
> >>
> >>   Does uprobes keep the XOL slot around (like krpobes does), or does it
> >>   create the slot afresh for each trap?
> >
> > XOL *page* is created once per process, lazily, and then we just
> > juggle a bunch of fixed slots there for each instance of
> > single-stepped uprobe. And yes, there are some bottlenecks in XOL
> > management, though it's mostly due to lock contention (as it is
> > implemented right now). Liao and Oleg have been improving XOL
> > management, but still, avoiding XOL in the first place is the much
> > preferred way.
> >
> >>
> >>   If that's trying to create a slot afresh for each trap, there are
> >>   several opportunities for improvement, e.g. keep the slot around for
> >>   as long as the uprobe exists, or pre-allocate shared slots for common
> >>   instructions and use those.
> >
> > As I mentioned, a XOL page is allocated and mapped once, but yes, it
> > seems like we dynamically get a slot in it for each single-stepped
> > execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's
> > probably not a bad idea to just cache and hold a XOL slot for each
> > specific uprobe, I don't see why we should limit ourselves to just one
> > XOL page. We also don't need to pre-size each slot, we can probably
> > allocate just the right amount of space for a given uprobe.
> >
> > All good ideas for sure, we should do them, IMO. But we'll still be
> > paying an extra kernel->user->kernel switch, which almost certainly is
> > slower than doing a simple stack push emulation just like we do in
> > x86-64 case, no?
> >
> >
> > BTW, I did a quick local profiling run. I don't think XOL management
> > is the main source of overhead. I see 5% of CPU cycles spent in
> > arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
> > traces. There are at least 22% CPU cycles spent in some
> > local_daif_restore function, though, not sure what that is, but might
> > be related to interrupt handling, right?
>
> The local_daif_restore() is part of the path for all user->kernel->user
> context switch, including interrupt handling, breakpoints, and single-stepping
> etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't
> been enable to reproduce this on my local machine.
>
> Andrii, could you use the patch below to see if it can reduce the 5% of
> CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache
> synchronization is the cause of this part of overhead.
>
> https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/

tbh, I think pre-allocating and setting up fixed XOL slots instead of
dynamically allocating them is the way to go. We can allocate as many
special "[uprobes]" pages as necessary to accommodate all the
single-stepped uprobes in XOL, remember their index, etc. I think
that's much more performant (and simpler, IMO) approach overall. All
this preparation, however expensive it might be, will be done once per
each attached/detached uprobe/uretprobe, which is the place where we
want to do expensive stuff. Not when uprobe/uretprobe is actually
triggered.

>
> >
> >
> > The take away I'd like to communicate here is avoiding the
> > single-stepping need is *the best way* to go, IMO. So if we can
> > emulate those STP instructions for uprobe *cheaply*, that would be
> > awesome.
>
> Given some significant uprobe optimizations from Oleg and Andrii
> merged, I am curious to see how these changes impact the profiling
> result on Arm64. So I re-ran the selftest bench on the latest kernel
> (based on tag next-20241104) and the kernel (based on tag next-20240909)
> that I used when I submitted this patch. The results re-ran are shown
> below.
>
> next-20240909(xol stp + xol nop)
> --------------------------------
> uprobe-nop      ( 1 cpus):    0.424 ± 0.000M/s  (  0.424M/s/cpu)
> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.101 ± 0.002M/s  (  2.101M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.347 ± 0.000M/s  (  0.347M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.349 ± 0.000M/s  (  0.349M/s/cpu)
> uretprobe-ret   ( 1 cpus):    1.051 ± 0.001M/s  (  1.051M/s/cpu)
>
> next-20240909(sim stp + sim nop)
> --------------------------------
> uprobe-nop      ( 1 cpus):    2.042 ± 0.002M/s  (  2.042M/s/cpu)
> uprobe-push     ( 1 cpus):    1.363 ± 0.002M/s  (  1.363M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
> uretprobe-nop   ( 1 cpus):    1.049 ± 0.001M/s  (  1.049M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.780 ± 0.000M/s  (  0.780M/s/cpu)
> uretprobe-ret   ( 1 cpus):    1.065 ± 0.001M/s  (  1.065M/s/cpu)
>
> next-20241104 (xol stp + sim nop)
> ---------------------------------
> uprobe-nop      ( 1 cpus):    2.044 ± 0.003M/s  (  2.044M/s/cpu)
> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.047 ± 0.001M/s  (  2.047M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.832 ± 0.003M/s  (  0.832M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.328 ± 0.000M/s  (  0.328M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.833 ± 0.003M/s  (  0.833M/s/cpu)
>
> next-20241104 (sim stp + sim nop)
> ---------------------------------
> uprobe-nop      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
> uprobe-push     ( 1 cpus):    1.411 ± 0.002M/s  (  1.411M/s/cpu)
> uprobe-ret      ( 1 cpus):    2.052 ± 0.005M/s  (  2.052M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.839 ± 0.005M/s  (  0.839M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.702 ± 0.002M/s  (  0.702M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.837 ± 0.001M/s  (  0.837M/s/cpu)
>
> It seems that the STP simluation approach in this patch significantly
> improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s)
> and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels
> bases on next-20240909 and next-20241104. While there is still room
> for improvement to reach the throughput of -nop and -ret, the gains
> are very substantail.
>
> But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using
> single-stepping stp, which are far lower compared to the result when
> when I submitted patch(look closely to the uprobe-push and uretprobe-push
> results in commit log). I'm certain that the tests were run on the
> same bare-metal machine with background tasked minimized. I doubt some
> uncommitted uprobe optimization on my local repo twist the result of
> -push using single-step.

You can always profiler and compare before/after, right? See where
added costs are coming from?

>
> In addition to the micro benchmark, I also re-ran Redis benchmark to
> compare the impact of single-stepping STP and simluated STP to the
> throughput of redis-server. I believe the impact of uprobe on the real
> application depends on the frequency of uprobe triggered and the application's
> hot paths. Therefore, I wouldn't say the simluated STP will benefit all
> real world applications.

It will benefit *a lot* of real world applications, though, so I think
it's very important to improve.

>
> $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET
> $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & &&
>   bpftrace -e 'uprobe:redis-server:readQueryFromClient{}
>                uprobe:redis-server:processCommand{}
>                uprobe:redis-server:aeApiPoll {}'
>
> next-20241104
> -------------
> RPS: 55602.1
>
> next-20241104 + ss stp
> ----------------------
> RPS: 47220.9
> uprobe@@aeApiPoll: 554565
> uprobe@processCommand: 1275160
> uprobe@readQueryFromClient: 1277710
>
> next-20241104 + sim stp
> -----------------------
> RPS           54290.09
> uprobe@aeApiPoll: 496007
> uprobe@processCommand: 1275160
> uprobe@readQueryFromClient: 1277710
>
> Andrii expressed concern that the STP simulation in this patch is too
> expensive. If we believe the result I re-ran, perhaps it is not a
> bad way to simluate STP. Looking forward to your feedbacks, or someone
> could propose a cheaper way to simluate STP, I'm very happy to test it
> on my machine, thanks.

I'm no ARM64 expert, but seeing that we simulate stack pushes with
just memory reads/write for x86-64, it feels like that should be
satisfactory for ARM64. So I'd suggest you to go back to the initial
implementation, clean it up, rebase, re-benchmark, and send a new
revision. Let's continue the discussion there?

>
> [...]
>
> >>>
> >>> xol-stp
> >>> -------
> >>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> >>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> >>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> >>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> >>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> >>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
> >>>
> >>> simulated-stp
> >>> -------------
> >>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> >>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> >>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> >>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> >>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> >>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
> >>>
> >>> The profiling results based on the upstream kernel with spinlock
> >>> optimization patches [2] reveals the simulation of STP increase the
> >>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> >>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> >>>
> >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> >>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> >>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
> >>>
> >>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>> ---
> >>>  arch/arm64/include/asm/insn.h            |  1 +
> >>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
> >>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
> >>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
> >>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
> >>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
> >>>  arch/arm64/lib/insn.c                    |  5 ++
> >>>  7 files changed, 134 insertions(+)
> >>>
> >
> > [...]
>
> --
> BR
> Liao, Chang
>
Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Liao, Chang 3 weeks, 1 day ago

在 2024/11/7 3:45, Andrii Nakryiko 写道:
> On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>> Andrii and Mark.
>>
>> 在 2024/10/26 4:51, Andrii Nakryiko 写道:
>>> On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
>>>>
>>>> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
>>>>> This patch is the second part of a series to improve the selftest bench
>>>>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
>>>>> significantly impact uprobe/uretprobe performance at function entry in
>>>>> most user cases. Profiling results below reveals the STP that executes
>>>>> in the xol slot and trap back to kernel, reduce redis RPS and increase
>>>>> the time of string grep obviously.
>>>>>
>>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>>>
>>>>> Redis GET (higher is better)
>>>>> ----------------------------
>>>>> No uprobe: 49149.71 RPS
>>>>> Single-stepped STP: 46750.82 RPS
>>>>> Emulated STP: 48981.19 RPS
>>>>>
>>>>> Redis SET (larger is better)
>>>>> ----------------------------
>>>>> No uprobe: 49761.14 RPS
>>>>> Single-stepped STP: 45255.01 RPS
>>>>> Emulated stp: 48619.21 RPS
>>>>>
>>>>> Grep (lower is better)
>>>>> ----------------------
>>>>> No uprobe: 2.165s
>>>>> Single-stepped STP: 15.314s
>>>>> Emualted STP: 2.216s
>>>>
>>>> The results for grep are concerning.
>>>>
>>>> In theory, the overhead for stepping should be roughly double the
>>>> overhead for emulating, assuming the exception-entry and
>>>> exception-return are the dominant cost. The cost of stepping should be
>>>> trivial.
>>>>
>>>> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
>>>> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
>>>> more expensive.
>>>>
>>>> Was this tested bare-metal, or in a VM?
>>>
>>> Hey Mark, I hope Liao will have a chance to reply, I don't know the
>>> details of his benchmarking. But I can try to give you my numbers and
>>> maybe answer a few questions, hopefully that helps move the
>>> conversation forward.
>>>
>>> So, first of all, I did a quick benchmark on bare metal (without
>>> Liao's optimization, though), here are my results:
>>>
>>> uprobe-nop            ( 1 cpus):    2.334 ± 0.011M/s  (  2.334M/s/cpu)
>>> uprobe-push           ( 1 cpus):    2.321 ± 0.010M/s  (  2.321M/s/cpu)
>>> uprobe-ret            ( 1 cpus):    4.144 ± 0.041M/s  (  4.144M/s/cpu)
>>>
>>> uretprobe-nop         ( 1 cpus):    1.684 ± 0.004M/s  (  1.684M/s/cpu)
>>> uretprobe-push        ( 1 cpus):    1.736 ± 0.003M/s  (  1.736M/s/cpu)
>>> uretprobe-ret         ( 1 cpus):    2.502 ± 0.006M/s  (  2.502M/s/cpu)
>>>
>>> uretprobes are inherently slower, so I'll just compare uprobe, as the
>>> differences are very clear either way.
>>>
>>> -nop is literally nop (Liao solved that issue, I just don't have his
>>> patch applied on my test machine). -push has `stp     x29, x30, [sp,
>>> #-0x10]!` instruction traced. -ret is literally just `ret`
>>> instruction.
>>>
>>> So you can see that -ret is almost twice as fast as the -push variant
>>> (it's a microbenchmark, yes, but still).
>>>
>>>>
>>>> AFAICT either:
>>>>
>>>> * Single-stepping is unexpectedly expensive.
>>>>
>>>>   Historically we had performance issues with hypervisor trapping of
>>>>   debug features, and there are things we might be able to improve in
>>>>   the hypervisor and kernel, which would improve stepping *all*
>>>>   instructions.
>>>>
>>>
>>> Single-stepping will always be more expensive, as it necessitates
>>> extra hop kernel->user space->kernel, so no matter the optimization
>>> for single-stepping, if we can avoid it, we should. It will be
>>> noticeable.
>>>
>>>>   If stepping is the big problem, we could move uprobes over to a BRK
>>>>   rather than a single-step. That would require require updating and
>>>>   fixing the logic to decide which instructions are steppable, but
>>>>   that's necessary anyway given it has extant soundness issues.
>>>
>>> I'm afraid I don't understand what BRK means and what are the
>>> consequences in terms of overheads. I'm not an ARM person either, so
>>> sorry if that's a stupid question. But either way, I can't address
>>> this. But see above, emulating an instruction feels like a much better
>>> approach, if possible.
>>
>> As I understand, Mark's suggestion is to place a BRK instruction next to
>> the instruction in the xol slot. Once the instruction in the xol slot
>> executed, the BRK instruction would trigger a trap into kernel. This is
>> a common technique used on platforms that don't support hardware single-
>> step. However, since Arm64 does support hardware single-stepping, kernel
>> enables it in pre_ssout(), allowing the CPU to automatically trap into kernel
>> after instruction in xol slot executed. But even we move uprobes over
>> to a BRK rather than a single-step. It can't reduce the overhead of user->
>> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark,
>> could you give more details about the BRK.
>>
> 
> I see, thanks for elaborating. So the suggestion was to go from very
> expensive single-stepping mode to still expensive breakpoint-based
> kernel->user->kernel workflow.
> 
> I think either way it's going to be much slower than avoiding
> kernel->user->kernel hop, so we should emulate STP instead, yep.

Exactly, in most cases, simluation is the better option.

> 
>>>
>>>>
>>>> * XOL management is absurdly expensive.
>>>>
>>>>   Does uprobes keep the XOL slot around (like krpobes does), or does it
>>>>   create the slot afresh for each trap?
>>>
>>> XOL *page* is created once per process, lazily, and then we just
>>> juggle a bunch of fixed slots there for each instance of
>>> single-stepped uprobe. And yes, there are some bottlenecks in XOL
>>> management, though it's mostly due to lock contention (as it is
>>> implemented right now). Liao and Oleg have been improving XOL
>>> management, but still, avoiding XOL in the first place is the much
>>> preferred way.
>>>
>>>>
>>>>   If that's trying to create a slot afresh for each trap, there are
>>>>   several opportunities for improvement, e.g. keep the slot around for
>>>>   as long as the uprobe exists, or pre-allocate shared slots for common
>>>>   instructions and use those.
>>>
>>> As I mentioned, a XOL page is allocated and mapped once, but yes, it
>>> seems like we dynamically get a slot in it for each single-stepped
>>> execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's
>>> probably not a bad idea to just cache and hold a XOL slot for each
>>> specific uprobe, I don't see why we should limit ourselves to just one
>>> XOL page. We also don't need to pre-size each slot, we can probably
>>> allocate just the right amount of space for a given uprobe.
>>>
>>> All good ideas for sure, we should do them, IMO. But we'll still be
>>> paying an extra kernel->user->kernel switch, which almost certainly is
>>> slower than doing a simple stack push emulation just like we do in
>>> x86-64 case, no?
>>>
>>>
>>> BTW, I did a quick local profiling run. I don't think XOL management
>>> is the main source of overhead. I see 5% of CPU cycles spent in
>>> arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
>>> traces. There are at least 22% CPU cycles spent in some
>>> local_daif_restore function, though, not sure what that is, but might
>>> be related to interrupt handling, right?
>>
>> The local_daif_restore() is part of the path for all user->kernel->user
>> context switch, including interrupt handling, breakpoints, and single-stepping
>> etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't
>> been enable to reproduce this on my local machine.
>>
>> Andrii, could you use the patch below to see if it can reduce the 5% of
>> CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache
>> synchronization is the cause of this part of overhead.
>>
>> https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/
> 
> tbh, I think pre-allocating and setting up fixed XOL slots instead of
> dynamically allocating them is the way to go. We can allocate as many
> special "[uprobes]" pages as necessary to accommodate all the
> single-stepped uprobes in XOL, remember their index, etc. I think
> that's much more performant (and simpler, IMO) approach overall. All
> this preparation, however expensive it might be, will be done once per
> each attached/detached uprobe/uretprobe, which is the place where we
> want to do expensive stuff. Not when uprobe/uretprobe is actually
> triggered.

Generally agreed. But I have two concerns about pre-allocating of XOL slots:

1. If some uprobes/uretprobes are rarely or never triggered, pre-allocating
   slots for them seem wastful. However, it isn't a issue on machines with
   ample memory(e.g., hundreds of GB or a couple of TB).

2. Currently, threads that trigger the same uprobe/uretprobe are dynamically
   allocated different slots. Since we can't predict how many threads will
   trigger the same uprobe/uretprobe. it can't pre-allocate enough slots for
   per each uprobe/uretprobe. If you allow all threads to share the slot
   associated with each uprobe/uretprobe. this would result in XOL pages being
   shared across procecces, I think it would introduce significant changes to
   the page management of xol_area and fault handling.

> 
>>
>>>
>>>
>>> The take away I'd like to communicate here is avoiding the
>>> single-stepping need is *the best way* to go, IMO. So if we can
>>> emulate those STP instructions for uprobe *cheaply*, that would be
>>> awesome.
>>
>> Given some significant uprobe optimizations from Oleg and Andrii
>> merged, I am curious to see how these changes impact the profiling
>> result on Arm64. So I re-ran the selftest bench on the latest kernel
>> (based on tag next-20241104) and the kernel (based on tag next-20240909)
>> that I used when I submitted this patch. The results re-ran are shown
>> below.
>>
>> next-20240909(xol stp + xol nop)
>> --------------------------------
>> uprobe-nop      ( 1 cpus):    0.424 ± 0.000M/s  (  0.424M/s/cpu)
>> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
>> uprobe-ret      ( 1 cpus):    2.101 ± 0.002M/s  (  2.101M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.347 ± 0.000M/s  (  0.347M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.349 ± 0.000M/s  (  0.349M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    1.051 ± 0.001M/s  (  1.051M/s/cpu)
>>
>> next-20240909(sim stp + sim nop)
>> --------------------------------
>> uprobe-nop      ( 1 cpus):    2.042 ± 0.002M/s  (  2.042M/s/cpu)
>> uprobe-push     ( 1 cpus):    1.363 ± 0.002M/s  (  1.363M/s/cpu)
>> uprobe-ret      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    1.049 ± 0.001M/s  (  1.049M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.780 ± 0.000M/s  (  0.780M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    1.065 ± 0.001M/s  (  1.065M/s/cpu)
>>
>> next-20241104 (xol stp + sim nop)
>> ---------------------------------
>> uprobe-nop      ( 1 cpus):    2.044 ± 0.003M/s  (  2.044M/s/cpu)
>> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
>> uprobe-ret      ( 1 cpus):    2.047 ± 0.001M/s  (  2.047M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.832 ± 0.003M/s  (  0.832M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.328 ± 0.000M/s  (  0.328M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.833 ± 0.003M/s  (  0.833M/s/cpu)
>>
>> next-20241104 (sim stp + sim nop)
>> ---------------------------------
>> uprobe-nop      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
>> uprobe-push     ( 1 cpus):    1.411 ± 0.002M/s  (  1.411M/s/cpu)
>> uprobe-ret      ( 1 cpus):    2.052 ± 0.005M/s  (  2.052M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.839 ± 0.005M/s  (  0.839M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.702 ± 0.002M/s  (  0.702M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.837 ± 0.001M/s  (  0.837M/s/cpu)
>>
>> It seems that the STP simluation approach in this patch significantly
>> improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s)
>> and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels
>> bases on next-20240909 and next-20241104. While there is still room
>> for improvement to reach the throughput of -nop and -ret, the gains
>> are very substantail.
>>
>> But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using
>> single-stepping stp, which are far lower compared to the result when
>> when I submitted patch(look closely to the uprobe-push and uretprobe-push
>> results in commit log). I'm certain that the tests were run on the
>> same bare-metal machine with background tasked minimized. I doubt some
>> uncommitted uprobe optimization on my local repo twist the result of
>> -push using single-step.
> 
> You can always profiler and compare before/after, right? See where
> added costs are coming from?

Yes, I've been looking through the git reflog to restore the
kernel tree to the state when I submitted this patch.

Regarding the throughtput data of uprobe/uretprobe-push using
single-stepping, I believe *the re-ran result provide a more
accurate picture*. If we carefully compare the throughput of
-nop and -push, we can see that they are very close (uprobe-nop
is 0.424M/s/cpu and uprobe-push is 0.415M/s/cpu, uretprobe-nop
is 0.347M/s/cpu and upretprobe-push is 0.349M/s/cpu). This is
expected, as both use single-stepping to execute NOP and STP.
There's no reason -push using single-step should outperform the
one of -nop(uprobe-push is 0.868M/s/cpu a month ago).

In summary, understanding these performance is crucial for selecting
an approach that balances accuracy and efficiency. Although this
patch offers lower gain compared to my initial implementation.
But, considering the complexity of 'STP', the cost seems acceptable,
right?

> 
>>
>> In addition to the micro benchmark, I also re-ran Redis benchmark to
>> compare the impact of single-stepping STP and simluated STP to the
>> throughput of redis-server. I believe the impact of uprobe on the real
>> application depends on the frequency of uprobe triggered and the application's
>> hot paths. Therefore, I wouldn't say the simluated STP will benefit all
>> real world applications.
> 
> It will benefit *a lot* of real world applications, though, so I think
> it's very important to improve.

Good to see that.

> 
>>
>> $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET
>> $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & &&
>>   bpftrace -e 'uprobe:redis-server:readQueryFromClient{}
>>                uprobe:redis-server:processCommand{}
>>                uprobe:redis-server:aeApiPoll {}'
>>
>> next-20241104
>> -------------
>> RPS: 55602.1
>>
>> next-20241104 + ss stp
>> ----------------------
>> RPS: 47220.9
>> uprobe@@aeApiPoll: 554565
>> uprobe@processCommand: 1275160
>> uprobe@readQueryFromClient: 1277710
>>>> next-20241104 + sim stp
>> -----------------------
>> RPS           54290.09
>> uprobe@aeApiPoll: 496007
>> uprobe@processCommand: 1275160
>> uprobe@readQueryFromClient: 1277710
>>
>> Andrii expressed concern that the STP simulation in this patch is too
>> expensive. If we believe the result I re-ran, perhaps it is not a
>> bad way to simluate STP. Looking forward to your feedbacks, or someone
>> could propose a cheaper way to simluate STP, I'm very happy to test it
>> on my machine, thanks.
> 
> I'm no ARM64 expert, but seeing that we simulate stack pushes with
> just memory reads/write for x86-64, it feels like that should be
> satisfactory for ARM64. So I'd suggest you to go back to the initial
> implementation, clean it up, rebase, re-benchmark, and send a new
> revision. Let's continue the discussion there?

I've re-benchmarked the initial implementation on the latest kernel
tree (tag next-20241104). I paste the results for single-stepping,
the initial patch and this patch here for comparision. Please see
the results below:

next-20241104 + xol stp
-----------------------
uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
uretprobe-push  ( 1 cpus):    0.328 ± 0.000M/s  (  0.328M/s/cpu)

next-20221104 + initial patch
-----------------------------
uprobe-push     ( 1 cpus):    1.798 ± 0.001M/s  (  1.798M/s/cpu)
uretprobe-push  ( 1 cpus):    0.806 ± 0.001M/s  (  0.806M/s/cpu)

next-20241104 + this patch
--------------------------
uprobe-push     ( 1 cpus):    1.411 ± 0.002M/s  (  1.411M/s/cpu)
uretprobe-push  ( 1 cpus):    0.702 ± 0.002M/s  (  0.702M/s/cpu)

As shown in the benchmark results, the initial implementation offers
a limited performance advantage, especailly, it comes at the cost of
accuracy.

> 
>>
>> [...]
>>
>>>>>
>>>>> xol-stp
>>>>> -------
>>>>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
>>>>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
>>>>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
>>>>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
>>>>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
>>>>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
>>>>>
>>>>> simulated-stp
>>>>> -------------
>>>>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
>>>>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
>>>>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
>>>>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
>>>>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
>>>>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
>>>>>
>>>>> The profiling results based on the upstream kernel with spinlock
>>>>> optimization patches [2] reveals the simulation of STP increase the
>>>>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
>>>>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
>>>>>
>>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>>>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
>>>>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
>>>>>
>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>> ---
>>>>>  arch/arm64/include/asm/insn.h            |  1 +
>>>>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>>>>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>>>>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>>>>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>>>>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>>>>>  arch/arm64/lib/insn.c                    |  5 ++
>>>>>  7 files changed, 134 insertions(+)
>>>>>
>>>
>>> [...]
>>
>> --
>> BR
>> Liao, Chang
>>

-- 
BR
Liao, Chang

Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Andrii Nakryiko 3 weeks ago
On Thu, Nov 7, 2024 at 7:13 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/11/7 3:45, Andrii Nakryiko 写道:
> > On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >> Andrii and Mark.
> >>
> >> 在 2024/10/26 4:51, Andrii Nakryiko 写道:
> >>> On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >>>>
> >>>> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote:
> >>>>> This patch is the second part of a series to improve the selftest bench
> >>>>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> >>>>> significantly impact uprobe/uretprobe performance at function entry in
> >>>>> most user cases. Profiling results below reveals the STP that executes
> >>>>> in the xol slot and trap back to kernel, reduce redis RPS and increase
> >>>>> the time of string grep obviously.
> >>>>>
> >>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>>>>
> >>>>> Redis GET (higher is better)
> >>>>> ----------------------------
> >>>>> No uprobe: 49149.71 RPS
> >>>>> Single-stepped STP: 46750.82 RPS
> >>>>> Emulated STP: 48981.19 RPS
> >>>>>
> >>>>> Redis SET (larger is better)
> >>>>> ----------------------------
> >>>>> No uprobe: 49761.14 RPS
> >>>>> Single-stepped STP: 45255.01 RPS
> >>>>> Emulated stp: 48619.21 RPS
> >>>>>
> >>>>> Grep (lower is better)
> >>>>> ----------------------
> >>>>> No uprobe: 2.165s
> >>>>> Single-stepped STP: 15.314s
> >>>>> Emualted STP: 2.216s
> >>>>
> >>>> The results for grep are concerning.
> >>>>
> >>>> In theory, the overhead for stepping should be roughly double the
> >>>> overhead for emulating, assuming the exception-entry and
> >>>> exception-return are the dominant cost. The cost of stepping should be
> >>>> trivial.
> >>>>
> >>>> Those results show emulating adds 0.051s (for a ~2.4% overhead), while
> >>>> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x
> >>>> more expensive.
> >>>>
> >>>> Was this tested bare-metal, or in a VM?
> >>>
> >>> Hey Mark, I hope Liao will have a chance to reply, I don't know the
> >>> details of his benchmarking. But I can try to give you my numbers and
> >>> maybe answer a few questions, hopefully that helps move the
> >>> conversation forward.
> >>>
> >>> So, first of all, I did a quick benchmark on bare metal (without
> >>> Liao's optimization, though), here are my results:
> >>>
> >>> uprobe-nop            ( 1 cpus):    2.334 ± 0.011M/s  (  2.334M/s/cpu)
> >>> uprobe-push           ( 1 cpus):    2.321 ± 0.010M/s  (  2.321M/s/cpu)
> >>> uprobe-ret            ( 1 cpus):    4.144 ± 0.041M/s  (  4.144M/s/cpu)
> >>>
> >>> uretprobe-nop         ( 1 cpus):    1.684 ± 0.004M/s  (  1.684M/s/cpu)
> >>> uretprobe-push        ( 1 cpus):    1.736 ± 0.003M/s  (  1.736M/s/cpu)
> >>> uretprobe-ret         ( 1 cpus):    2.502 ± 0.006M/s  (  2.502M/s/cpu)
> >>>
> >>> uretprobes are inherently slower, so I'll just compare uprobe, as the
> >>> differences are very clear either way.
> >>>
> >>> -nop is literally nop (Liao solved that issue, I just don't have his
> >>> patch applied on my test machine). -push has `stp     x29, x30, [sp,
> >>> #-0x10]!` instruction traced. -ret is literally just `ret`
> >>> instruction.
> >>>
> >>> So you can see that -ret is almost twice as fast as the -push variant
> >>> (it's a microbenchmark, yes, but still).
> >>>
> >>>>
> >>>> AFAICT either:
> >>>>
> >>>> * Single-stepping is unexpectedly expensive.
> >>>>
> >>>>   Historically we had performance issues with hypervisor trapping of
> >>>>   debug features, and there are things we might be able to improve in
> >>>>   the hypervisor and kernel, which would improve stepping *all*
> >>>>   instructions.
> >>>>
> >>>
> >>> Single-stepping will always be more expensive, as it necessitates
> >>> extra hop kernel->user space->kernel, so no matter the optimization
> >>> for single-stepping, if we can avoid it, we should. It will be
> >>> noticeable.
> >>>
> >>>>   If stepping is the big problem, we could move uprobes over to a BRK
> >>>>   rather than a single-step. That would require require updating and
> >>>>   fixing the logic to decide which instructions are steppable, but
> >>>>   that's necessary anyway given it has extant soundness issues.
> >>>
> >>> I'm afraid I don't understand what BRK means and what are the
> >>> consequences in terms of overheads. I'm not an ARM person either, so
> >>> sorry if that's a stupid question. But either way, I can't address
> >>> this. But see above, emulating an instruction feels like a much better
> >>> approach, if possible.
> >>
> >> As I understand, Mark's suggestion is to place a BRK instruction next to
> >> the instruction in the xol slot. Once the instruction in the xol slot
> >> executed, the BRK instruction would trigger a trap into kernel. This is
> >> a common technique used on platforms that don't support hardware single-
> >> step. However, since Arm64 does support hardware single-stepping, kernel
> >> enables it in pre_ssout(), allowing the CPU to automatically trap into kernel
> >> after instruction in xol slot executed. But even we move uprobes over
> >> to a BRK rather than a single-step. It can't reduce the overhead of user->
> >> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark,
> >> could you give more details about the BRK.
> >>
> >
> > I see, thanks for elaborating. So the suggestion was to go from very
> > expensive single-stepping mode to still expensive breakpoint-based
> > kernel->user->kernel workflow.
> >
> > I think either way it's going to be much slower than avoiding
> > kernel->user->kernel hop, so we should emulate STP instead, yep.
>
> Exactly, in most cases, simluation is the better option.
>
> >
> >>>
> >>>>
> >>>> * XOL management is absurdly expensive.
> >>>>
> >>>>   Does uprobes keep the XOL slot around (like krpobes does), or does it
> >>>>   create the slot afresh for each trap?
> >>>
> >>> XOL *page* is created once per process, lazily, and then we just
> >>> juggle a bunch of fixed slots there for each instance of
> >>> single-stepped uprobe. And yes, there are some bottlenecks in XOL
> >>> management, though it's mostly due to lock contention (as it is
> >>> implemented right now). Liao and Oleg have been improving XOL
> >>> management, but still, avoiding XOL in the first place is the much
> >>> preferred way.
> >>>
> >>>>
> >>>>   If that's trying to create a slot afresh for each trap, there are
> >>>>   several opportunities for improvement, e.g. keep the slot around for
> >>>>   as long as the uprobe exists, or pre-allocate shared slots for common
> >>>>   instructions and use those.
> >>>
> >>> As I mentioned, a XOL page is allocated and mapped once, but yes, it
> >>> seems like we dynamically get a slot in it for each single-stepped
> >>> execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's
> >>> probably not a bad idea to just cache and hold a XOL slot for each
> >>> specific uprobe, I don't see why we should limit ourselves to just one
> >>> XOL page. We also don't need to pre-size each slot, we can probably
> >>> allocate just the right amount of space for a given uprobe.
> >>>
> >>> All good ideas for sure, we should do them, IMO. But we'll still be
> >>> paying an extra kernel->user->kernel switch, which almost certainly is
> >>> slower than doing a simple stack push emulation just like we do in
> >>> x86-64 case, no?
> >>>
> >>>
> >>> BTW, I did a quick local profiling run. I don't think XOL management
> >>> is the main source of overhead. I see 5% of CPU cycles spent in
> >>> arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
> >>> traces. There are at least 22% CPU cycles spent in some
> >>> local_daif_restore function, though, not sure what that is, but might
> >>> be related to interrupt handling, right?
> >>
> >> The local_daif_restore() is part of the path for all user->kernel->user
> >> context switch, including interrupt handling, breakpoints, and single-stepping
> >> etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't
> >> been enable to reproduce this on my local machine.
> >>
> >> Andrii, could you use the patch below to see if it can reduce the 5% of
> >> CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache
> >> synchronization is the cause of this part of overhead.
> >>
> >> https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/
> >
> > tbh, I think pre-allocating and setting up fixed XOL slots instead of
> > dynamically allocating them is the way to go. We can allocate as many
> > special "[uprobes]" pages as necessary to accommodate all the
> > single-stepped uprobes in XOL, remember their index, etc. I think
> > that's much more performant (and simpler, IMO) approach overall. All
> > this preparation, however expensive it might be, will be done once per
> > each attached/detached uprobe/uretprobe, which is the place where we
> > want to do expensive stuff. Not when uprobe/uretprobe is actually
> > triggered.
>
> Generally agreed. But I have two concerns about pre-allocating of XOL slots:
>
> 1. If some uprobes/uretprobes are rarely or never triggered, pre-allocating
>    slots for them seem wastful. However, it isn't a issue on machines with
>    ample memory(e.g., hundreds of GB or a couple of TB).

I'm not too worried here because a) hopefully most of uprobes won't
need XOL and b) if user attaches to lots of uprobes and needs lots of
XOL slots, then machine is assumed to be powerful enough to handle
that. If we run out of memory, then that machine shouldn't be uprobed
that extensively. Generally, I think this is not a big concern.

But keep in mind, if we go this way, I think we need to make XOL slots
as small and exactly sized as necessary to fit just enough code to
handle single-stepping for that particular uprobe. Currently we size
them to 128 bytes to make it easy to dynamically allocate them in the
hot path. With pre-allocating XOL we have different tradeoffs and I'd
allocate them as tightly as possible, because attach/detach can be
significantly slower than uprobe/uretprobe triggering.

>
> 2. Currently, threads that trigger the same uprobe/uretprobe are dynamically
>    allocated different slots. Since we can't predict how many threads will
>    trigger the same uprobe/uretprobe. it can't pre-allocate enough slots for
>    per each uprobe/uretprobe. If you allow all threads to share the slot
>    associated with each uprobe/uretprobe. this would result in XOL pages being
>    shared across procecces, I think it would introduce significant changes to
>    the page management of xol_area and fault handling.

Yes, I was thinking that all threads will share the same XOL slot, why
would you want that per-thread? And yes, it's a pretty significant
change, but I think it's a good one.

>
> >
> >>
> >>>
> >>>
> >>> The take away I'd like to communicate here is avoiding the
> >>> single-stepping need is *the best way* to go, IMO. So if we can
> >>> emulate those STP instructions for uprobe *cheaply*, that would be
> >>> awesome.
> >>
> >> Given some significant uprobe optimizations from Oleg and Andrii
> >> merged, I am curious to see how these changes impact the profiling
> >> result on Arm64. So I re-ran the selftest bench on the latest kernel
> >> (based on tag next-20241104) and the kernel (based on tag next-20240909)
> >> that I used when I submitted this patch. The results re-ran are shown
> >> below.
> >>
> >> next-20240909(xol stp + xol nop)
> >> --------------------------------
> >> uprobe-nop      ( 1 cpus):    0.424 ± 0.000M/s  (  0.424M/s/cpu)
> >> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
> >> uprobe-ret      ( 1 cpus):    2.101 ± 0.002M/s  (  2.101M/s/cpu)
> >> uretprobe-nop   ( 1 cpus):    0.347 ± 0.000M/s  (  0.347M/s/cpu)
> >> uretprobe-push  ( 1 cpus):    0.349 ± 0.000M/s  (  0.349M/s/cpu)
> >> uretprobe-ret   ( 1 cpus):    1.051 ± 0.001M/s  (  1.051M/s/cpu)
> >>
> >> next-20240909(sim stp + sim nop)
> >> --------------------------------
> >> uprobe-nop      ( 1 cpus):    2.042 ± 0.002M/s  (  2.042M/s/cpu)
> >> uprobe-push     ( 1 cpus):    1.363 ± 0.002M/s  (  1.363M/s/cpu)
> >> uprobe-ret      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
> >> uretprobe-nop   ( 1 cpus):    1.049 ± 0.001M/s  (  1.049M/s/cpu)
> >> uretprobe-push  ( 1 cpus):    0.780 ± 0.000M/s  (  0.780M/s/cpu)
> >> uretprobe-ret   ( 1 cpus):    1.065 ± 0.001M/s  (  1.065M/s/cpu)
> >>
> >> next-20241104 (xol stp + sim nop)
> >> ---------------------------------
> >> uprobe-nop      ( 1 cpus):    2.044 ± 0.003M/s  (  2.044M/s/cpu)
> >> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
> >> uprobe-ret      ( 1 cpus):    2.047 ± 0.001M/s  (  2.047M/s/cpu)
> >> uretprobe-nop   ( 1 cpus):    0.832 ± 0.003M/s  (  0.832M/s/cpu)
> >> uretprobe-push  ( 1 cpus):    0.328 ± 0.000M/s  (  0.328M/s/cpu)
> >> uretprobe-ret   ( 1 cpus):    0.833 ± 0.003M/s  (  0.833M/s/cpu)
> >>
> >> next-20241104 (sim stp + sim nop)
> >> ---------------------------------
> >> uprobe-nop      ( 1 cpus):    2.052 ± 0.002M/s  (  2.052M/s/cpu)
> >> uprobe-push     ( 1 cpus):    1.411 ± 0.002M/s  (  1.411M/s/cpu)
> >> uprobe-ret      ( 1 cpus):    2.052 ± 0.005M/s  (  2.052M/s/cpu)
> >> uretprobe-nop   ( 1 cpus):    0.839 ± 0.005M/s  (  0.839M/s/cpu)
> >> uretprobe-push  ( 1 cpus):    0.702 ± 0.002M/s  (  0.702M/s/cpu)
> >> uretprobe-ret   ( 1 cpus):    0.837 ± 0.001M/s  (  0.837M/s/cpu)
> >>
> >> It seems that the STP simluation approach in this patch significantly
> >> improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s)
> >> and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels
> >> bases on next-20240909 and next-20241104. While there is still room
> >> for improvement to reach the throughput of -nop and -ret, the gains
> >> are very substantail.
> >>
> >> But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using
> >> single-stepping stp, which are far lower compared to the result when
> >> when I submitted patch(look closely to the uprobe-push and uretprobe-push
> >> results in commit log). I'm certain that the tests were run on the
> >> same bare-metal machine with background tasked minimized. I doubt some
> >> uncommitted uprobe optimization on my local repo twist the result of
> >> -push using single-step.
> >
> > You can always profiler and compare before/after, right? See where
> > added costs are coming from?
>
> Yes, I've been looking through the git reflog to restore the
> kernel tree to the state when I submitted this patch.
>
> Regarding the throughtput data of uprobe/uretprobe-push using
> single-stepping, I believe *the re-ran result provide a more
> accurate picture*. If we carefully compare the throughput of
> -nop and -push, we can see that they are very close (uprobe-nop
> is 0.424M/s/cpu and uprobe-push is 0.415M/s/cpu, uretprobe-nop
> is 0.347M/s/cpu and upretprobe-push is 0.349M/s/cpu). This is
> expected, as both use single-stepping to execute NOP and STP.
> There's no reason -push using single-step should outperform the
> one of -nop(uprobe-push is 0.868M/s/cpu a month ago).
>
> In summary, understanding these performance is crucial for selecting
> an approach that balances accuracy and efficiency. Although this
> patch offers lower gain compared to my initial implementation.
> But, considering the complexity of 'STP', the cost seems acceptable,
> right?

But why do the suboptimal thing if we can do better?

>
> >
> >>
> >> In addition to the micro benchmark, I also re-ran Redis benchmark to
> >> compare the impact of single-stepping STP and simluated STP to the
> >> throughput of redis-server. I believe the impact of uprobe on the real
> >> application depends on the frequency of uprobe triggered and the application's
> >> hot paths. Therefore, I wouldn't say the simluated STP will benefit all
> >> real world applications.
> >
> > It will benefit *a lot* of real world applications, though, so I think
> > it's very important to improve.
>
> Good to see that.
>
> >
> >>
> >> $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET
> >> $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & &&
> >>   bpftrace -e 'uprobe:redis-server:readQueryFromClient{}
> >>                uprobe:redis-server:processCommand{}
> >>                uprobe:redis-server:aeApiPoll {}'
> >>
> >> next-20241104
> >> -------------
> >> RPS: 55602.1
> >>
> >> next-20241104 + ss stp
> >> ----------------------
> >> RPS: 47220.9
> >> uprobe@@aeApiPoll: 554565
> >> uprobe@processCommand: 1275160
> >> uprobe@readQueryFromClient: 1277710
> >>>> next-20241104 + sim stp
> >> -----------------------
> >> RPS           54290.09
> >> uprobe@aeApiPoll: 496007
> >> uprobe@processCommand: 1275160
> >> uprobe@readQueryFromClient: 1277710
> >>
> >> Andrii expressed concern that the STP simulation in this patch is too
> >> expensive. If we believe the result I re-ran, perhaps it is not a
> >> bad way to simluate STP. Looking forward to your feedbacks, or someone
> >> could propose a cheaper way to simluate STP, I'm very happy to test it
> >> on my machine, thanks.
> >
> > I'm no ARM64 expert, but seeing that we simulate stack pushes with
> > just memory reads/write for x86-64, it feels like that should be
> > satisfactory for ARM64. So I'd suggest you to go back to the initial
> > implementation, clean it up, rebase, re-benchmark, and send a new
> > revision. Let's continue the discussion there?
>
> I've re-benchmarked the initial implementation on the latest kernel
> tree (tag next-20241104). I paste the results for single-stepping,
> the initial patch and this patch here for comparision. Please see
> the results below:
>
> next-20241104 + xol stp
> -----------------------
> uprobe-push     ( 1 cpus):    0.415 ± 0.001M/s  (  0.415M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.328 ± 0.000M/s  (  0.328M/s/cpu)
>
> next-20221104 + initial patch
> -----------------------------
> uprobe-push     ( 1 cpus):    1.798 ± 0.001M/s  (  1.798M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.806 ± 0.001M/s  (  0.806M/s/cpu)
>
> next-20241104 + this patch
> --------------------------
> uprobe-push     ( 1 cpus):    1.411 ± 0.002M/s  (  1.411M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.702 ± 0.002M/s  (  0.702M/s/cpu)
>
> As shown in the benchmark results, the initial implementation offers
> a limited performance advantage, especailly, it comes at the cost of
> accuracy.
>

What limited accuracy? And since when almost 30% (1.4M/s -> 1.8M/s) is
a "limited performance advantage"?

> >
> >>
> >> [...]
> >>
> >>>>>
> >>>>> xol-stp
> >>>>> -------
> >>>>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> >>>>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> >>>>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> >>>>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> >>>>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> >>>>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
> >>>>>
> >>>>> simulated-stp
> >>>>> -------------
> >>>>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> >>>>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> >>>>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> >>>>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> >>>>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> >>>>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
> >>>>>
> >>>>> The profiling results based on the upstream kernel with spinlock
> >>>>> optimization patches [2] reveals the simulation of STP increase the
> >>>>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> >>>>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> >>>>>
> >>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> >>>>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> >>>>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
> >>>>>
> >>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>>>> ---
> >>>>>  arch/arm64/include/asm/insn.h            |  1 +
> >>>>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
> >>>>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
> >>>>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
> >>>>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
> >>>>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
> >>>>>  arch/arm64/lib/insn.c                    |  5 ++
> >>>>>  7 files changed, 134 insertions(+)
> >>>>>
> >>>
> >>> [...]
> >>
> >> --
> >> BR
> >> Liao, Chang
> >>
>
> --
> BR
> Liao, Chang
>
Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Masami Hiramatsu (Google) 1 month ago
On Fri, 25 Oct 2024 13:51:14 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> All good ideas for sure, we should do them, IMO. But we'll still be
> paying an extra kernel->user->kernel switch, which almost certainly is
> slower than doing a simple stack push emulation just like we do in
> x86-64 case, no?
> 
> 
> BTW, I did a quick local profiling run. I don't think XOL management
> is the main source of overhead. I see 5% of CPU cycles spent in
> arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack
> traces. There are at least 22% CPU cycles spent in some
> local_daif_restore function, though, not sure what that is, but might
> be related to interrupt handling, right?
> 
> 
> The take away I'd like to communicate here is avoiding the
> single-stepping need is *the best way* to go, IMO. So if we can
> emulate those STP instructions for uprobe *cheaply*, that would be
> awesome.

+1.
Unlike the kprobe, uprobe singlestep needs to go userspace (for
sacurity), which can introduce much bigger overhead. If we can just
emulate the instruction safely in the kernel, it should be done.

Thank you,

-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>
Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Andrii Nakryiko 2 months, 2 weeks ago
On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@huawei.com> wrote:
>
> This patch is the second part of a series to improve the selftest bench
> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> significantly impact uprobe/uretprobe performance at function entry in
> most user cases. Profiling results below reveals the STP that executes
> in the xol slot and trap back to kernel, reduce redis RPS and increase
> the time of string grep obviously.
>
> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>
> Redis GET (higher is better)
> ----------------------------
> No uprobe: 49149.71 RPS
> Single-stepped STP: 46750.82 RPS
> Emulated STP: 48981.19 RPS
>
> Redis SET (larger is better)
> ----------------------------
> No uprobe: 49761.14 RPS
> Single-stepped STP: 45255.01 RPS
> Emulated stp: 48619.21 RPS
>
> Grep (lower is better)
> ----------------------
> No uprobe: 2.165s
> Single-stepped STP: 15.314s
> Emualted STP: 2.216s
>
> Additionally, a profiling of the entry instruction for all leaf and
> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
> 50%. So simulting the STP on the function entry is a more viable option
> for uprobe.
>
> In the first version [1], it used a uaccess routine to simulate the STP
> that push fp/lr into stack, which use double STTR instructions for
> memory store. But as Mark pointed out, this approach can't simulate the
> correct single-atomicity and ordering properties of STP, especiallly
> when it interacts with MTE, POE, etc. So this patch uses a more complex

Does all those effects matter if the thread is stopped after
breakpoint? This is pushing to stack, right? Other threads are not
supposed to access that memory anyways (not the well-defined ones, at
least, I suppose). Do we really need all these complications for
uprobes? We use a similar approach in x86-64, see emulate_push_stack()
in arch/x86/kernel/uprobes.c and it works great in practice (and has
been for years by now). Would be nice to keep things simple knowing
that this is specifically for this rather well-defined and restricted
uprobe/uretprobe use case.

Sorry, I can't help reviewing this, but I have a hunch that we might
be over-killing it with this approach, no?


> and inefficient approach that acquires user stack pages, maps them to
> kernel address space, and allows kernel to use STP directly push fp/lr
> into the stack pages.
>
> xol-stp
> -------
> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
>
> simulated-stp
> -------------
> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
>
> The profiling results based on the upstream kernel with spinlock
> optimization patches [2] reveals the simulation of STP increase the
> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
>
> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  arch/arm64/include/asm/insn.h            |  1 +
>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>  arch/arm64/lib/insn.c                    |  5 ++
>  7 files changed, 134 insertions(+)
>

[...]
Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Liao, Chang 2 months, 2 weeks ago

在 2024/9/11 4:54, Andrii Nakryiko 写道:
> On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@huawei.com> wrote:
>>
>> This patch is the second part of a series to improve the selftest bench
>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
>> significantly impact uprobe/uretprobe performance at function entry in
>> most user cases. Profiling results below reveals the STP that executes
>> in the xol slot and trap back to kernel, reduce redis RPS and increase
>> the time of string grep obviously.
>>
>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>
>> Redis GET (higher is better)
>> ----------------------------
>> No uprobe: 49149.71 RPS
>> Single-stepped STP: 46750.82 RPS
>> Emulated STP: 48981.19 RPS
>>
>> Redis SET (larger is better)
>> ----------------------------
>> No uprobe: 49761.14 RPS
>> Single-stepped STP: 45255.01 RPS
>> Emulated stp: 48619.21 RPS
>>
>> Grep (lower is better)
>> ----------------------
>> No uprobe: 2.165s
>> Single-stepped STP: 15.314s
>> Emualted STP: 2.216s
>>
>> Additionally, a profiling of the entry instruction for all leaf and
>> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
>> 50%. So simulting the STP on the function entry is a more viable option
>> for uprobe.
>>
>> In the first version [1], it used a uaccess routine to simulate the STP
>> that push fp/lr into stack, which use double STTR instructions for
>> memory store. But as Mark pointed out, this approach can't simulate the
>> correct single-atomicity and ordering properties of STP, especiallly
>> when it interacts with MTE, POE, etc. So this patch uses a more complex
> 
> Does all those effects matter if the thread is stopped after
> breakpoint? This is pushing to stack, right? Other threads are not
> supposed to access that memory anyways (not the well-defined ones, at
> least, I suppose). Do we really need all these complications for

I have raised the same question in my reply to Mark. Since the STP
simulation focuses on the uprobe/uretprob at function entry, which
push two registers onto *stack*. I believe it might not require strict
alignment with the exact property of STP. However, as you know, Mark
stand by his comments about STP simulation, which is why I send this
patch out. Although the gain is not good as the uaccess version, it
still offer some better result than the current XOL code.

> uprobes? We use a similar approach in x86-64, see emulate_push_stack()
> in arch/x86/kernel/uprobes.c and it works great in practice (and has

Yes, I've noticed the X86 routine. Actually. The CPU-specific difference
lies in Arm64 CPUs with PAN enabled. Due to security reasons, it doesn't
support STP (storing pairs of registers to memory) when accessing userpsace
address. This leads to kernel has to use STTR instructions (storing single
register to unprivileged memory) twice, which can't meet the atomicity
and ordering properties of original STP at userspace. In future, if Arm64
would add some instruction for storing pairs of registers to unprivileged
memory, it ought to replace this inefficient approach.

> been for years by now). Would be nice to keep things simple knowing
> that this is specifically for this rather well-defined and restricted
> uprobe/uretprobe use case.
> 
> Sorry, I can't help reviewing this, but I have a hunch that we might
> be over-killing it with this approach, no?

This approach fails to obtain the max benefit from simuation indeed.

> 
> 
>> and inefficient approach that acquires user stack pages, maps them to
>> kernel address space, and allows kernel to use STP directly push fp/lr
>> into the stack pages.
>>
>> xol-stp
>> -------
>> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
>> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
>> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
>>
>> simulated-stp
>> -------------
>> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
>> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
>> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
>> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
>> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
>> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
>>
>> The profiling results based on the upstream kernel with spinlock
>> optimization patches [2] reveals the simulation of STP increase the
>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
>>
>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
>>
>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>> ---
>>  arch/arm64/include/asm/insn.h            |  1 +
>>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
>>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
>>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
>>  arch/arm64/lib/insn.c                    |  5 ++
>>  7 files changed, 134 insertions(+)
>>
> 
> [...]
> 
> 

-- 
BR
Liao, Chang
Re: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack
Posted by Andrii Nakryiko 2 months, 2 weeks ago
On Tue, Sep 10, 2024 at 8:07 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/9/11 4:54, Andrii Nakryiko 写道:
> > On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@huawei.com> wrote:
> >>
> >> This patch is the second part of a series to improve the selftest bench
> >> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
> >> significantly impact uprobe/uretprobe performance at function entry in
> >> most user cases. Profiling results below reveals the STP that executes
> >> in the xol slot and trap back to kernel, reduce redis RPS and increase
> >> the time of string grep obviously.
> >>
> >> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>
> >> Redis GET (higher is better)
> >> ----------------------------
> >> No uprobe: 49149.71 RPS
> >> Single-stepped STP: 46750.82 RPS
> >> Emulated STP: 48981.19 RPS
> >>
> >> Redis SET (larger is better)
> >> ----------------------------
> >> No uprobe: 49761.14 RPS
> >> Single-stepped STP: 45255.01 RPS
> >> Emulated stp: 48619.21 RPS
> >>
> >> Grep (lower is better)
> >> ----------------------
> >> No uprobe: 2.165s
> >> Single-stepped STP: 15.314s
> >> Emualted STP: 2.216s
> >>
> >> Additionally, a profiling of the entry instruction for all leaf and
> >> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
> >> 50%. So simulting the STP on the function entry is a more viable option
> >> for uprobe.
> >>
> >> In the first version [1], it used a uaccess routine to simulate the STP
> >> that push fp/lr into stack, which use double STTR instructions for
> >> memory store. But as Mark pointed out, this approach can't simulate the
> >> correct single-atomicity and ordering properties of STP, especiallly
> >> when it interacts with MTE, POE, etc. So this patch uses a more complex
> >
> > Does all those effects matter if the thread is stopped after
> > breakpoint? This is pushing to stack, right? Other threads are not
> > supposed to access that memory anyways (not the well-defined ones, at
> > least, I suppose). Do we really need all these complications for
>
> I have raised the same question in my reply to Mark. Since the STP
> simulation focuses on the uprobe/uretprob at function entry, which
> push two registers onto *stack*. I believe it might not require strict
> alignment with the exact property of STP. However, as you know, Mark

Agreed.

> stand by his comments about STP simulation, which is why I send this
> patch out. Although the gain is not good as the uaccess version, it
> still offer some better result than the current XOL code.
>
> > uprobes? We use a similar approach in x86-64, see emulate_push_stack()
> > in arch/x86/kernel/uprobes.c and it works great in practice (and has
>
> Yes, I've noticed the X86 routine. Actually. The CPU-specific difference
> lies in Arm64 CPUs with PAN enabled. Due to security reasons, it doesn't
> support STP (storing pairs of registers to memory) when accessing userpsace
> address. This leads to kernel has to use STTR instructions (storing single
> register to unprivileged memory) twice, which can't meet the atomicity
> and ordering properties of original STP at userspace. In future, if Arm64
> would add some instruction for storing pairs of registers to unprivileged
> memory, it ought to replace this inefficient approach.
>
> > been for years by now). Would be nice to keep things simple knowing
> > that this is specifically for this rather well-defined and restricted
> > uprobe/uretprobe use case.
> >
> > Sorry, I can't help reviewing this, but I have a hunch that we might
> > be over-killing it with this approach, no?
>
> This approach fails to obtain the max benefit from simuation indeed.
>

Yes, the performance hit is very large for seemingly no good reason,
which is why I'm asking.

And all this performance concern is not just some pure
microbenchmarking. We do have use cases with millions of uprobe calls
per second. E.g., tracing every single Python function call, then
rolling a dice (in BPF program), and sampling some portion of them
(more heavy-weight logic). As such, it's critical to be able to
trigger uprobe as fast as possible, then most of the time we do
nothing. So any overheads like this one are very noticeable and limit
possible applications.

> >
> >
> >> and inefficient approach that acquires user stack pages, maps them to
> >> kernel address space, and allows kernel to use STP directly push fp/lr
> >> into the stack pages.
> >>
> >> xol-stp
> >> -------
> >> uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
> >> uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
> >> uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
> >> uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
> >> uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
> >> uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)
> >>
> >> simulated-stp
> >> -------------
> >> uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
> >> uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
> >> uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
> >> uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
> >> uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
> >> uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)
> >>
> >> The profiling results based on the upstream kernel with spinlock
> >> optimization patches [2] reveals the simulation of STP increase the
> >> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
> >> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).
> >>
> >> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> >> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> >> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/
> >>
> >> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >> ---
> >>  arch/arm64/include/asm/insn.h            |  1 +
> >>  arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
> >>  arch/arm64/kernel/probes/decode-insn.h   |  1 +
> >>  arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
> >>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
> >>  arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
> >>  arch/arm64/lib/insn.c                    |  5 ++
> >>  7 files changed, 134 insertions(+)
> >>
> >
> > [...]
> >
> >
>
> --
> BR
> Liao, Chang