uprobes: Optimize the allocation of insn_slot for performance

[PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao Chang 1 year, 6 months ago

The profiling result of single-thread model of selftests bench reveals
performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
ARM64. On my local testing machine, 5% of CPU time is consumed by
find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.

This patch introduce struct uprobe_breakpoint to track previously
allocated insn_slot for frequently hit uprobe. it effectively reduce the
need for redundant insn_slot writes and subsequent expensive cache
flush, especially on architecture like ARM64. This patch has been tested
on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
bench and Redis GET/SET benchmark result below reveal obivious
performance gain.

before-opt
----------
trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
Redis SET (RPS) uprobe: 42728.52
Redis GET (RPS) uprobe: 43640.18
Redis SET (RPS) uretprobe: 40624.54
Redis GET (RPS) uretprobe: 41180.56

after-opt
---------
trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
Redis SET (RPS) uprobe: 43939.69
Redis GET (RPS) uprobe: 45200.80
Redis SET (RPS) uretprobe: 41658.58
Redis GET (RPS) uretprobe: 42805.80

While some uprobes might still need to share the same insn_slot, this
patch compare the instructions in the resued insn_slot with the
instructions execute out-of-line firstly to decides allocate a new one
or not.

Additionally, this patch use a rbtree associated with each thread that
hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
rbtree of uprobe_breakpoints has smaller node, better locality and less
contention, it result in faster lookup times compared to find_uprobe().

The other part of this patch are some necessary memory management for
uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
hit uprobe that doesn't already have a corresponding node in rbtree. All
uprobe_breakpoints will be freed when thread exit.

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 include/linux/uprobes.h |   3 +
 kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
 2 files changed, 211 insertions(+), 38 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..04ee465980af 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -78,6 +78,9 @@ struct uprobe_task {
 
 	struct return_instance		*return_instances;
 	unsigned int			depth;
+
+	struct rb_root			breakpoints_tree;
+	rwlock_t			breakpoints_treelock;
 };
 
 struct return_instance {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..3f1a6dc2a327 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -33,6 +33,7 @@
 #define MAX_UPROBE_XOL_SLOTS		UINSNS_PER_PAGE
 
 static struct rb_root uprobes_tree = RB_ROOT;
+
 /*
  * allows us to skip the uprobe_mmap if there are no uprobe events active
  * at this time.  Probably a fine grained per inode count is better?
@@ -886,6 +887,174 @@ static bool filter_chain(struct uprobe *uprobe,
 	return ret;
 }
 
+static struct uprobe_task *get_utask(void);
+static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr);
+static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp);
+
+struct uprobe_breakpoint {
+	struct rb_node		rb_node;
+	refcount_t		ref;
+	unsigned long		slot;
+	unsigned long		vaddr;
+	struct uprobe		*uprobe;
+};
+
+static void put_ubp(struct uprobe_breakpoint *ubp)
+{
+	if (refcount_dec_and_test(&ubp->ref)) {
+		put_uprobe(ubp->uprobe);
+		kfree(ubp);
+	}
+}
+
+static struct uprobe_breakpoint *get_ubp(struct uprobe_breakpoint *ubp)
+{
+	refcount_inc(&ubp->ref);
+	return ubp;
+}
+
+#define __node_2_ubp(node) rb_entry((node), struct uprobe_breakpoint, rb_node)
+
+struct __ubp_key {
+	unsigned long bp_vaddr;
+};
+
+static int ubp_cmp(const unsigned long bp_vaddr,
+		   const struct uprobe_breakpoint *ubp)
+{
+	if (bp_vaddr < ubp->vaddr)
+		return -1;
+
+	if (bp_vaddr > ubp->vaddr)
+		return 1;
+
+	return 0;
+}
+
+static int __ubp_cmp_key(const void *k, const struct rb_node *b)
+{
+	const struct __ubp_key *key = k;
+
+	return ubp_cmp(key->bp_vaddr, __node_2_ubp(b));
+}
+
+static int __ubp_cmp(struct rb_node *a, const struct rb_node *b)
+{
+	const struct uprobe_breakpoint *ubp = __node_2_ubp(a);
+
+	return ubp_cmp(ubp->vaddr, __node_2_ubp(b));
+}
+
+static void delete_breakpoint(struct uprobe_task *utask)
+{
+	struct rb_node *node;
+
+	write_lock(&utask->breakpoints_treelock);
+	node = rb_first(&utask->breakpoints_tree);
+	while (node) {
+		rb_erase(node, &utask->breakpoints_tree);
+		write_unlock(&utask->breakpoints_treelock);
+
+		put_ubp(__node_2_ubp(node));
+
+		write_lock(&utask->breakpoints_treelock);
+		node = rb_next(node);
+	}
+	write_unlock(&utask->breakpoints_treelock);
+}
+
+static struct uprobe_breakpoint *alloc_breakpoint(struct uprobe_task *utask,
+						  struct uprobe *uprobe,
+						  unsigned long bp_vaddr)
+{
+	struct uprobe_breakpoint *ubp;
+	struct rb_node *node;
+
+	ubp = kzalloc(sizeof(struct uprobe_breakpoint), GFP_KERNEL);
+	if (!ubp)
+		return NULL;
+
+	ubp->vaddr = bp_vaddr;
+	ubp->uprobe = uprobe;
+	/* get access + creation ref */
+	refcount_set(&ubp->ref, 2);
+	ubp->slot = UINSNS_PER_PAGE;
+
+	write_lock(&utask->breakpoints_treelock);
+	node = rb_find_add(&ubp->rb_node, &utask->breakpoints_tree, __ubp_cmp);
+	write_unlock(&utask->breakpoints_treelock);
+
+	/* Two or more threads hit the same breakpoint */
+	if (node) {
+		WARN_ON(uprobe != __node_2_ubp(node)->upobre);
+		kfree(ubp);
+		return get_ubp(__node_2_ubp(node));
+	}
+
+	return ubp;
+}
+
+static struct uprobe_breakpoint *find_breakpoint(struct uprobe_task *utask,
+						 unsigned long bp_vaddr)
+{
+	struct rb_node *node;
+	struct __ubp_key key = {
+		.bp_vaddr = bp_vaddr,
+	};
+
+	read_lock(&utask->breakpoints_treelock);
+	node = rb_find(&key, &utask->breakpoints_tree, __ubp_cmp_key);
+	read_unlock(&utask->breakpoints_treelock);
+
+	if (node)
+		return get_ubp(__node_2_ubp(node));
+
+	return NULL;
+}
+
+static struct uprobe_breakpoint *find_active_breakpoint(struct pt_regs *regs,
+							unsigned long bp_vaddr)
+{
+	struct uprobe_task *utask = get_utask();
+	struct uprobe_breakpoint *ubp;
+	struct uprobe *uprobe;
+	int is_swbp;
+
+	if (unlikely(!utask))
+		return NULL;
+
+	ubp = find_breakpoint(utask, bp_vaddr);
+	if (ubp)
+		return ubp;
+
+	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
+	if (!uprobe) {
+		if (is_swbp > 0) {
+			/* No matching uprobe; signal SIGTRAP. */
+			force_sig(SIGTRAP);
+		} else {
+			/*
+			 * Either we raced with uprobe_unregister() or we can't
+			 * access this memory. The latter is only possible if
+			 * another thread plays with our ->mm. In both cases
+			 * we can simply restart. If this vma was unmapped we
+			 * can pretend this insn was not executed yet and get
+			 * the (correct) SIGSEGV after restart.
+			 */
+			instruction_pointer_set(regs, bp_vaddr);
+		}
+		return NULL;
+	}
+
+	ubp = alloc_breakpoint(utask, uprobe, bp_vaddr);
+	if (!ubp) {
+		put_uprobe(uprobe);
+		return NULL;
+	}
+
+	return ubp;
+}
+
 static int
 install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long vaddr)
@@ -1576,9 +1745,8 @@ void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
 /*
  *  - search for a free slot.
  */
-static unsigned long xol_take_insn_slot(struct xol_area *area)
+static __always_inline int xol_take_insn_slot(struct xol_area *area)
 {
-	unsigned long slot_addr;
 	int slot_nr;
 
 	do {
@@ -1590,34 +1758,48 @@ static unsigned long xol_take_insn_slot(struct xol_area *area)
 			slot_nr = UINSNS_PER_PAGE;
 			continue;
 		}
-		wait_event(area->wq, (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
+		wait_event(area->wq,
+			   (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
 	} while (slot_nr >= UINSNS_PER_PAGE);
 
-	slot_addr = area->vaddr + (slot_nr * UPROBE_XOL_SLOT_BYTES);
-	atomic_inc(&area->slot_count);
+	return slot_nr;
+}
+
+static __always_inline unsigned long
+choose_insn_slot(struct xol_area *area, struct uprobe_breakpoint *ubp)
+{
+	if ((ubp->slot == UINSNS_PER_PAGE) ||
+	    test_and_set_bit(ubp->slot, area->bitmap)) {
+		ubp->slot = xol_take_insn_slot(area);
+	}
 
-	return slot_addr;
+	atomic_inc(&area->slot_count);
+	return area->vaddr + ubp->slot * UPROBE_XOL_SLOT_BYTES;
 }
 
 /*
  * xol_get_insn_slot - allocate a slot for xol.
  * Returns the allocated slot address or 0.
  */
-static unsigned long xol_get_insn_slot(struct uprobe *uprobe)
+static unsigned long xol_get_insn_slot(struct uprobe_breakpoint *ubp)
 {
 	struct xol_area *area;
 	unsigned long xol_vaddr;
+	struct uprobe *uprobe = ubp->uprobe;
 
 	area = get_xol_area();
 	if (!area)
 		return 0;
 
-	xol_vaddr = xol_take_insn_slot(area);
+	xol_vaddr = choose_insn_slot(area, ubp);
 	if (unlikely(!xol_vaddr))
 		return 0;
 
-	arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
-			      &uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
+	if (memcmp((void *)xol_vaddr, &uprobe->arch.ixol,
+		   sizeof(uprobe->arch.ixol)))
+		arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
+				      &uprobe->arch.ixol,
+				      sizeof(uprobe->arch.ixol));
 
 	return xol_vaddr;
 }
@@ -1717,8 +1899,7 @@ void uprobe_free_utask(struct task_struct *t)
 	if (!utask)
 		return;
 
-	if (utask->active_uprobe)
-		put_uprobe(utask->active_uprobe);
+	delete_breakpoint(utask);
 
 	ri = utask->return_instances;
 	while (ri)
@@ -1739,8 +1920,11 @@ void uprobe_free_utask(struct task_struct *t)
  */
 static struct uprobe_task *get_utask(void)
 {
-	if (!current->utask)
+	if (!current->utask) {
 		current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+		current->utask->breakpoints_tree = RB_ROOT;
+		rwlock_init(&current->utask->breakpoints_treelock);
+	}
 	return current->utask;
 }
 
@@ -1921,7 +2105,8 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 
 /* Prepare to single-step probed instruction out of line. */
 static int
-pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
+pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+	  struct uprobe_breakpoint *ubp)
 {
 	struct uprobe_task *utask;
 	unsigned long xol_vaddr;
@@ -1931,12 +2116,12 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	if (!utask)
 		return -ENOMEM;
 
-	xol_vaddr = xol_get_insn_slot(uprobe);
+	xol_vaddr = xol_get_insn_slot(ubp);
 	if (!xol_vaddr)
 		return -ENOMEM;
 
 	utask->xol_vaddr = xol_vaddr;
-	utask->vaddr = bp_vaddr;
+	utask->vaddr = ubp->vaddr;
 
 	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
 	if (unlikely(err)) {
@@ -2182,32 +2367,19 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
  */
 static void handle_swbp(struct pt_regs *regs)
 {
+	struct uprobe_breakpoint *ubp;
 	struct uprobe *uprobe;
 	unsigned long bp_vaddr;
-	int is_swbp;
 
 	bp_vaddr = uprobe_get_swbp_addr(regs);
 	if (bp_vaddr == get_trampoline_vaddr())
 		return handle_trampoline(regs);
 
-	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
-	if (!uprobe) {
-		if (is_swbp > 0) {
-			/* No matching uprobe; signal SIGTRAP. */
-			force_sig(SIGTRAP);
-		} else {
-			/*
-			 * Either we raced with uprobe_unregister() or we can't
-			 * access this memory. The latter is only possible if
-			 * another thread plays with our ->mm. In both cases
-			 * we can simply restart. If this vma was unmapped we
-			 * can pretend this insn was not executed yet and get
-			 * the (correct) SIGSEGV after restart.
-			 */
-			instruction_pointer_set(regs, bp_vaddr);
-		}
+	ubp = find_active_breakpoint(regs, bp_vaddr);
+	if (!ubp)
 		return;
-	}
+
+	uprobe = ubp->uprobe;
 
 	/* change it in advance for ->handler() and restart */
 	instruction_pointer_set(regs, bp_vaddr);
@@ -2241,12 +2413,11 @@ static void handle_swbp(struct pt_regs *regs)
 	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
 		goto out;
 
-	if (!pre_ssout(uprobe, regs, bp_vaddr))
-		return;
+	pre_ssout(uprobe, regs, ubp);
 
 	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
 out:
-	put_uprobe(uprobe);
+	put_ubp(ubp);
 }
 
 /*
@@ -2266,7 +2437,6 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 	else
 		WARN_ON_ONCE(1);
 
-	put_uprobe(uprobe);
 	utask->active_uprobe = NULL;
 	utask->state = UTASK_RUNNING;
 	xol_free_insn_slot(current);
-- 
2.34.1

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 6 months ago

Hi Andrii and Oleg.

This patch sent by me two weeks ago also aim to optimize the performance of uprobe
on arm64. I notice recent discussions on the performance and scalability of uprobes
within the mailing list. Considering this interest, I've added you and other relevant
maintainers to the CC list for broader visibility and potential collaboration.

Thanks.

在 2024/7/27 17:44, Liao Chang 写道:
> The profiling result of single-thread model of selftests bench reveals
> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> ARM64. On my local testing machine, 5% of CPU time is consumed by
> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> 
> This patch introduce struct uprobe_breakpoint to track previously
> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> need for redundant insn_slot writes and subsequent expensive cache
> flush, especially on architecture like ARM64. This patch has been tested
> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> bench and Redis GET/SET benchmark result below reveal obivious
> performance gain.
> 
> before-opt
> ----------
> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> Redis SET (RPS) uprobe: 42728.52
> Redis GET (RPS) uprobe: 43640.18
> Redis SET (RPS) uretprobe: 40624.54
> Redis GET (RPS) uretprobe: 41180.56
> 
> after-opt
> ---------
> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> Redis SET (RPS) uprobe: 43939.69
> Redis GET (RPS) uprobe: 45200.80
> Redis SET (RPS) uretprobe: 41658.58
> Redis GET (RPS) uretprobe: 42805.80
> 
> While some uprobes might still need to share the same insn_slot, this
> patch compare the instructions in the resued insn_slot with the
> instructions execute out-of-line firstly to decides allocate a new one
> or not.
> 
> Additionally, this patch use a rbtree associated with each thread that
> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> rbtree of uprobe_breakpoints has smaller node, better locality and less
> contention, it result in faster lookup times compared to find_uprobe().
> 
> The other part of this patch are some necessary memory management for
> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> hit uprobe that doesn't already have a corresponding node in rbtree. All
> uprobe_breakpoints will be freed when thread exit.
> 
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  include/linux/uprobes.h |   3 +
>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>  2 files changed, 211 insertions(+), 38 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index f46e0ca0169c..04ee465980af 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -78,6 +78,9 @@ struct uprobe_task {
>  
>  	struct return_instance		*return_instances;
>  	unsigned int			depth;
> +
> +	struct rb_root			breakpoints_tree;
> +	rwlock_t			breakpoints_treelock;
>  };
>  
>  struct return_instance {
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2c83ba776fc7..3f1a6dc2a327 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -33,6 +33,7 @@
>  #define MAX_UPROBE_XOL_SLOTS		UINSNS_PER_PAGE
>  
>  static struct rb_root uprobes_tree = RB_ROOT;
> +
>  /*
>   * allows us to skip the uprobe_mmap if there are no uprobe events active
>   * at this time.  Probably a fine grained per inode count is better?
> @@ -886,6 +887,174 @@ static bool filter_chain(struct uprobe *uprobe,
>  	return ret;
>  }
>  
> +static struct uprobe_task *get_utask(void);
> +static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr);
> +static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp);
> +
> +struct uprobe_breakpoint {
> +	struct rb_node		rb_node;
> +	refcount_t		ref;
> +	unsigned long		slot;
> +	unsigned long		vaddr;
> +	struct uprobe		*uprobe;
> +};
> +
> +static void put_ubp(struct uprobe_breakpoint *ubp)
> +{
> +	if (refcount_dec_and_test(&ubp->ref)) {
> +		put_uprobe(ubp->uprobe);
> +		kfree(ubp);
> +	}
> +}
> +
> +static struct uprobe_breakpoint *get_ubp(struct uprobe_breakpoint *ubp)
> +{
> +	refcount_inc(&ubp->ref);
> +	return ubp;
> +}
> +
> +#define __node_2_ubp(node) rb_entry((node), struct uprobe_breakpoint, rb_node)
> +
> +struct __ubp_key {
> +	unsigned long bp_vaddr;
> +};
> +
> +static int ubp_cmp(const unsigned long bp_vaddr,
> +		   const struct uprobe_breakpoint *ubp)
> +{
> +	if (bp_vaddr < ubp->vaddr)
> +		return -1;
> +
> +	if (bp_vaddr > ubp->vaddr)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static int __ubp_cmp_key(const void *k, const struct rb_node *b)
> +{
> +	const struct __ubp_key *key = k;
> +
> +	return ubp_cmp(key->bp_vaddr, __node_2_ubp(b));
> +}
> +
> +static int __ubp_cmp(struct rb_node *a, const struct rb_node *b)
> +{
> +	const struct uprobe_breakpoint *ubp = __node_2_ubp(a);
> +
> +	return ubp_cmp(ubp->vaddr, __node_2_ubp(b));
> +}
> +
> +static void delete_breakpoint(struct uprobe_task *utask)
> +{
> +	struct rb_node *node;
> +
> +	write_lock(&utask->breakpoints_treelock);
> +	node = rb_first(&utask->breakpoints_tree);
> +	while (node) {
> +		rb_erase(node, &utask->breakpoints_tree);
> +		write_unlock(&utask->breakpoints_treelock);
> +
> +		put_ubp(__node_2_ubp(node));
> +
> +		write_lock(&utask->breakpoints_treelock);
> +		node = rb_next(node);
> +	}
> +	write_unlock(&utask->breakpoints_treelock);
> +}
> +
> +static struct uprobe_breakpoint *alloc_breakpoint(struct uprobe_task *utask,
> +						  struct uprobe *uprobe,
> +						  unsigned long bp_vaddr)
> +{
> +	struct uprobe_breakpoint *ubp;
> +	struct rb_node *node;
> +
> +	ubp = kzalloc(sizeof(struct uprobe_breakpoint), GFP_KERNEL);
> +	if (!ubp)
> +		return NULL;
> +
> +	ubp->vaddr = bp_vaddr;
> +	ubp->uprobe = uprobe;
> +	/* get access + creation ref */
> +	refcount_set(&ubp->ref, 2);
> +	ubp->slot = UINSNS_PER_PAGE;
> +
> +	write_lock(&utask->breakpoints_treelock);
> +	node = rb_find_add(&ubp->rb_node, &utask->breakpoints_tree, __ubp_cmp);
> +	write_unlock(&utask->breakpoints_treelock);
> +
> +	/* Two or more threads hit the same breakpoint */
> +	if (node) {
> +		WARN_ON(uprobe != __node_2_ubp(node)->upobre);

A stupid typo.

s/upobre/uprobe/g

> +		kfree(ubp);
> +		return get_ubp(__node_2_ubp(node));
> +	}
> +
> +	return ubp;
> +}
> +
> +static struct uprobe_breakpoint *find_breakpoint(struct uprobe_task *utask,
> +						 unsigned long bp_vaddr)
> +{
> +	struct rb_node *node;
> +	struct __ubp_key key = {
> +		.bp_vaddr = bp_vaddr,
> +	};
> +
> +	read_lock(&utask->breakpoints_treelock);
> +	node = rb_find(&key, &utask->breakpoints_tree, __ubp_cmp_key);
> +	read_unlock(&utask->breakpoints_treelock);
> +
> +	if (node)
> +		return get_ubp(__node_2_ubp(node));
> +
> +	return NULL;
> +}
> +
> +static struct uprobe_breakpoint *find_active_breakpoint(struct pt_regs *regs,
> +							unsigned long bp_vaddr)
> +{
> +	struct uprobe_task *utask = get_utask();
> +	struct uprobe_breakpoint *ubp;
> +	struct uprobe *uprobe;
> +	int is_swbp;
> +
> +	if (unlikely(!utask))
> +		return NULL;
> +
> +	ubp = find_breakpoint(utask, bp_vaddr);
> +	if (ubp)
> +		return ubp;
> +
> +	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> +	if (!uprobe) {
> +		if (is_swbp > 0) {
> +			/* No matching uprobe; signal SIGTRAP. */
> +			force_sig(SIGTRAP);
> +		} else {
> +			/*
> +			 * Either we raced with uprobe_unregister() or we can't
> +			 * access this memory. The latter is only possible if
> +			 * another thread plays with our ->mm. In both cases
> +			 * we can simply restart. If this vma was unmapped we
> +			 * can pretend this insn was not executed yet and get
> +			 * the (correct) SIGSEGV after restart.
> +			 */
> +			instruction_pointer_set(regs, bp_vaddr);
> +		}
> +		return NULL;
> +	}
> +
> +	ubp = alloc_breakpoint(utask, uprobe, bp_vaddr);
> +	if (!ubp) {
> +		put_uprobe(uprobe);
> +		return NULL;
> +	}
> +
> +	return ubp;
> +}
> +
>  static int
>  install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
>  			struct vm_area_struct *vma, unsigned long vaddr)
> @@ -1576,9 +1745,8 @@ void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
>  /*
>   *  - search for a free slot.
>   */
> -static unsigned long xol_take_insn_slot(struct xol_area *area)
> +static __always_inline int xol_take_insn_slot(struct xol_area *area)
>  {
> -	unsigned long slot_addr;
>  	int slot_nr;
>  
>  	do {
> @@ -1590,34 +1758,48 @@ static unsigned long xol_take_insn_slot(struct xol_area *area)
>  			slot_nr = UINSNS_PER_PAGE;
>  			continue;
>  		}
> -		wait_event(area->wq, (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
> +		wait_event(area->wq,
> +			   (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
>  	} while (slot_nr >= UINSNS_PER_PAGE);
>  
> -	slot_addr = area->vaddr + (slot_nr * UPROBE_XOL_SLOT_BYTES);
> -	atomic_inc(&area->slot_count);
> +	return slot_nr;
> +}
> +
> +static __always_inline unsigned long
> +choose_insn_slot(struct xol_area *area, struct uprobe_breakpoint *ubp)
> +{
> +	if ((ubp->slot == UINSNS_PER_PAGE) ||
> +	    test_and_set_bit(ubp->slot, area->bitmap)) {
> +		ubp->slot = xol_take_insn_slot(area);
> +	}
>  
> -	return slot_addr;
> +	atomic_inc(&area->slot_count);
> +	return area->vaddr + ubp->slot * UPROBE_XOL_SLOT_BYTES;
>  }
>  
>  /*
>   * xol_get_insn_slot - allocate a slot for xol.
>   * Returns the allocated slot address or 0.
>   */
> -static unsigned long xol_get_insn_slot(struct uprobe *uprobe)
> +static unsigned long xol_get_insn_slot(struct uprobe_breakpoint *ubp)
>  {
>  	struct xol_area *area;
>  	unsigned long xol_vaddr;
> +	struct uprobe *uprobe = ubp->uprobe;
>  
>  	area = get_xol_area();
>  	if (!area)
>  		return 0;
>  
> -	xol_vaddr = xol_take_insn_slot(area);
> +	xol_vaddr = choose_insn_slot(area, ubp);
>  	if (unlikely(!xol_vaddr))
>  		return 0;
>  
> -	arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
> -			      &uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
> +	if (memcmp((void *)xol_vaddr, &uprobe->arch.ixol,
> +		   sizeof(uprobe->arch.ixol)))
> +		arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
> +				      &uprobe->arch.ixol,
> +				      sizeof(uprobe->arch.ixol));

Perhaps, should i move memcmp() to the arch_uprobe_copy_ixol() provided by the architecture
code?

>  
>  	return xol_vaddr;
>  }
> @@ -1717,8 +1899,7 @@ void uprobe_free_utask(struct task_struct *t)
>  	if (!utask)
>  		return;
>  
> -	if (utask->active_uprobe)
> -		put_uprobe(utask->active_uprobe);
> +	delete_breakpoint(utask);
>  
>  	ri = utask->return_instances;
>  	while (ri)
> @@ -1739,8 +1920,11 @@ void uprobe_free_utask(struct task_struct *t)
>   */
>  static struct uprobe_task *get_utask(void)
>  {
> -	if (!current->utask)
> +	if (!current->utask) {
>  		current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
> +		current->utask->breakpoints_tree = RB_ROOT;
> +		rwlock_init(&current->utask->breakpoints_treelock);
> +	}
>  	return current->utask;
>  }
>  
> @@ -1921,7 +2105,8 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>  
>  /* Prepare to single-step probed instruction out of line. */
>  static int
> -pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
> +pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
> +	  struct uprobe_breakpoint *ubp)
>  {
>  	struct uprobe_task *utask;
>  	unsigned long xol_vaddr;
> @@ -1931,12 +2116,12 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
>  	if (!utask)
>  		return -ENOMEM;
>  
> -	xol_vaddr = xol_get_insn_slot(uprobe);
> +	xol_vaddr = xol_get_insn_slot(ubp);
>  	if (!xol_vaddr)
>  		return -ENOMEM;
>  
>  	utask->xol_vaddr = xol_vaddr;
> -	utask->vaddr = bp_vaddr;
> +	utask->vaddr = ubp->vaddr;
>  
>  	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
>  	if (unlikely(err)) {
> @@ -2182,32 +2367,19 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
>   */
>  static void handle_swbp(struct pt_regs *regs)
>  {
> +	struct uprobe_breakpoint *ubp;
>  	struct uprobe *uprobe;
>  	unsigned long bp_vaddr;
> -	int is_swbp;
>  
>  	bp_vaddr = uprobe_get_swbp_addr(regs);
>  	if (bp_vaddr == get_trampoline_vaddr())
>  		return handle_trampoline(regs);
>  
> -	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> -	if (!uprobe) {
> -		if (is_swbp > 0) {
> -			/* No matching uprobe; signal SIGTRAP. */
> -			force_sig(SIGTRAP);
> -		} else {
> -			/*
> -			 * Either we raced with uprobe_unregister() or we can't
> -			 * access this memory. The latter is only possible if
> -			 * another thread plays with our ->mm. In both cases
> -			 * we can simply restart. If this vma was unmapped we
> -			 * can pretend this insn was not executed yet and get
> -			 * the (correct) SIGSEGV after restart.
> -			 */
> -			instruction_pointer_set(regs, bp_vaddr);
> -		}
> +	ubp = find_active_breakpoint(regs, bp_vaddr);
> +	if (!ubp)
>  		return;
> -	}
> +
> +	uprobe = ubp->uprobe;
>  
>  	/* change it in advance for ->handler() and restart */
>  	instruction_pointer_set(regs, bp_vaddr);
> @@ -2241,12 +2413,11 @@ static void handle_swbp(struct pt_regs *regs)
>  	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>  		goto out;
>  
> -	if (!pre_ssout(uprobe, regs, bp_vaddr))
> -		return;
> +	pre_ssout(uprobe, regs, ubp);
>  
>  	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
>  out:
> -	put_uprobe(uprobe);
> +	put_ubp(ubp);
>  }
>  
>  /*
> @@ -2266,7 +2437,6 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
>  	else
>  		WARN_ON_ONCE(1);
>  
> -	put_uprobe(uprobe);
>  	utask->active_uprobe = NULL;
>  	utask->state = UTASK_RUNNING;
>  	xol_free_insn_slot(current);

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Oleg Nesterov 1 year, 6 months ago

Hi Liao,

To be honest I didn't even try to look at your patch, sorry.

Because I think it would be better to delay it in an case. Until Andrii
finishes/pushes his optimization changes which (in particular) include
find_active_uprobe_rcu/etc.

Then you can rebease and re-benchmark your patch on top of these changes.

Oleg.


On 08/08, Liao, Chang wrote:
>
> Hi Andrii and Oleg.
>
> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> on arm64. I notice recent discussions on the performance and scalability of uprobes
> within the mailing list. Considering this interest, I've added you and other relevant
> maintainers to the CC list for broader visibility and potential collaboration.
>
> Thanks.
>
> 在 2024/7/27 17:44, Liao Chang 写道:
> > The profiling result of single-thread model of selftests bench reveals
> > performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> > ARM64. On my local testing machine, 5% of CPU time is consumed by
> > find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> > about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >
> > This patch introduce struct uprobe_breakpoint to track previously
> > allocated insn_slot for frequently hit uprobe. it effectively reduce the
> > need for redundant insn_slot writes and subsequent expensive cache
> > flush, especially on architecture like ARM64. This patch has been tested
> > on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> > bench and Redis GET/SET benchmark result below reveal obivious
> > performance gain.
> >
> > before-opt
> > ----------
> > trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> > trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> > trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> > trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> > trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> > trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> > Redis SET (RPS) uprobe: 42728.52
> > Redis GET (RPS) uprobe: 43640.18
> > Redis SET (RPS) uretprobe: 40624.54
> > Redis GET (RPS) uretprobe: 41180.56
> >
> > after-opt
> > ---------
> > trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> > trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> > trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> > trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> > trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> > trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> > Redis SET (RPS) uprobe: 43939.69
> > Redis GET (RPS) uprobe: 45200.80
> > Redis SET (RPS) uretprobe: 41658.58
> > Redis GET (RPS) uretprobe: 42805.80
> >
> > While some uprobes might still need to share the same insn_slot, this
> > patch compare the instructions in the resued insn_slot with the
> > instructions execute out-of-line firstly to decides allocate a new one
> > or not.
> >
> > Additionally, this patch use a rbtree associated with each thread that
> > hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> > rbtree of uprobe_breakpoints has smaller node, better locality and less
> > contention, it result in faster lookup times compared to find_uprobe().
> >
> > The other part of this patch are some necessary memory management for
> > uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> > hit uprobe that doesn't already have a corresponding node in rbtree. All
> > uprobe_breakpoints will be freed when thread exit.
> >
> > Signed-off-by: Liao Chang <liaochang1@huawei.com>
> > ---
> >  include/linux/uprobes.h |   3 +
> >  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> >  2 files changed, 211 insertions(+), 38 deletions(-)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index f46e0ca0169c..04ee465980af 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -78,6 +78,9 @@ struct uprobe_task {
> >
> >  	struct return_instance		*return_instances;
> >  	unsigned int			depth;
> > +
> > +	struct rb_root			breakpoints_tree;
> > +	rwlock_t			breakpoints_treelock;
> >  };
> >
> >  struct return_instance {
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 2c83ba776fc7..3f1a6dc2a327 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -33,6 +33,7 @@
> >  #define MAX_UPROBE_XOL_SLOTS		UINSNS_PER_PAGE
> >
> >  static struct rb_root uprobes_tree = RB_ROOT;
> > +
> >  /*
> >   * allows us to skip the uprobe_mmap if there are no uprobe events active
> >   * at this time.  Probably a fine grained per inode count is better?
> > @@ -886,6 +887,174 @@ static bool filter_chain(struct uprobe *uprobe,
> >  	return ret;
> >  }
> >
> > +static struct uprobe_task *get_utask(void);
> > +static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr);
> > +static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp);
> > +
> > +struct uprobe_breakpoint {
> > +	struct rb_node		rb_node;
> > +	refcount_t		ref;
> > +	unsigned long		slot;
> > +	unsigned long		vaddr;
> > +	struct uprobe		*uprobe;
> > +};
> > +
> > +static void put_ubp(struct uprobe_breakpoint *ubp)
> > +{
> > +	if (refcount_dec_and_test(&ubp->ref)) {
> > +		put_uprobe(ubp->uprobe);
> > +		kfree(ubp);
> > +	}
> > +}
> > +
> > +static struct uprobe_breakpoint *get_ubp(struct uprobe_breakpoint *ubp)
> > +{
> > +	refcount_inc(&ubp->ref);
> > +	return ubp;
> > +}
> > +
> > +#define __node_2_ubp(node) rb_entry((node), struct uprobe_breakpoint, rb_node)
> > +
> > +struct __ubp_key {
> > +	unsigned long bp_vaddr;
> > +};
> > +
> > +static int ubp_cmp(const unsigned long bp_vaddr,
> > +		   const struct uprobe_breakpoint *ubp)
> > +{
> > +	if (bp_vaddr < ubp->vaddr)
> > +		return -1;
> > +
> > +	if (bp_vaddr > ubp->vaddr)
> > +		return 1;
> > +
> > +	return 0;
> > +}
> > +
> > +static int __ubp_cmp_key(const void *k, const struct rb_node *b)
> > +{
> > +	const struct __ubp_key *key = k;
> > +
> > +	return ubp_cmp(key->bp_vaddr, __node_2_ubp(b));
> > +}
> > +
> > +static int __ubp_cmp(struct rb_node *a, const struct rb_node *b)
> > +{
> > +	const struct uprobe_breakpoint *ubp = __node_2_ubp(a);
> > +
> > +	return ubp_cmp(ubp->vaddr, __node_2_ubp(b));
> > +}
> > +
> > +static void delete_breakpoint(struct uprobe_task *utask)
> > +{
> > +	struct rb_node *node;
> > +
> > +	write_lock(&utask->breakpoints_treelock);
> > +	node = rb_first(&utask->breakpoints_tree);
> > +	while (node) {
> > +		rb_erase(node, &utask->breakpoints_tree);
> > +		write_unlock(&utask->breakpoints_treelock);
> > +
> > +		put_ubp(__node_2_ubp(node));
> > +
> > +		write_lock(&utask->breakpoints_treelock);
> > +		node = rb_next(node);
> > +	}
> > +	write_unlock(&utask->breakpoints_treelock);
> > +}
> > +
> > +static struct uprobe_breakpoint *alloc_breakpoint(struct uprobe_task *utask,
> > +						  struct uprobe *uprobe,
> > +						  unsigned long bp_vaddr)
> > +{
> > +	struct uprobe_breakpoint *ubp;
> > +	struct rb_node *node;
> > +
> > +	ubp = kzalloc(sizeof(struct uprobe_breakpoint), GFP_KERNEL);
> > +	if (!ubp)
> > +		return NULL;
> > +
> > +	ubp->vaddr = bp_vaddr;
> > +	ubp->uprobe = uprobe;
> > +	/* get access + creation ref */
> > +	refcount_set(&ubp->ref, 2);
> > +	ubp->slot = UINSNS_PER_PAGE;
> > +
> > +	write_lock(&utask->breakpoints_treelock);
> > +	node = rb_find_add(&ubp->rb_node, &utask->breakpoints_tree, __ubp_cmp);
> > +	write_unlock(&utask->breakpoints_treelock);
> > +
> > +	/* Two or more threads hit the same breakpoint */
> > +	if (node) {
> > +		WARN_ON(uprobe != __node_2_ubp(node)->upobre);
>
> A stupid typo.
>
> s/upobre/uprobe/g
>
> > +		kfree(ubp);
> > +		return get_ubp(__node_2_ubp(node));
> > +	}
> > +
> > +	return ubp;
> > +}
> > +
> > +static struct uprobe_breakpoint *find_breakpoint(struct uprobe_task *utask,
> > +						 unsigned long bp_vaddr)
> > +{
> > +	struct rb_node *node;
> > +	struct __ubp_key key = {
> > +		.bp_vaddr = bp_vaddr,
> > +	};
> > +
> > +	read_lock(&utask->breakpoints_treelock);
> > +	node = rb_find(&key, &utask->breakpoints_tree, __ubp_cmp_key);
> > +	read_unlock(&utask->breakpoints_treelock);
> > +
> > +	if (node)
> > +		return get_ubp(__node_2_ubp(node));
> > +
> > +	return NULL;
> > +}
> > +
> > +static struct uprobe_breakpoint *find_active_breakpoint(struct pt_regs *regs,
> > +							unsigned long bp_vaddr)
> > +{
> > +	struct uprobe_task *utask = get_utask();
> > +	struct uprobe_breakpoint *ubp;
> > +	struct uprobe *uprobe;
> > +	int is_swbp;
> > +
> > +	if (unlikely(!utask))
> > +		return NULL;
> > +
> > +	ubp = find_breakpoint(utask, bp_vaddr);
> > +	if (ubp)
> > +		return ubp;
> > +
> > +	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> > +	if (!uprobe) {
> > +		if (is_swbp > 0) {
> > +			/* No matching uprobe; signal SIGTRAP. */
> > +			force_sig(SIGTRAP);
> > +		} else {
> > +			/*
> > +			 * Either we raced with uprobe_unregister() or we can't
> > +			 * access this memory. The latter is only possible if
> > +			 * another thread plays with our ->mm. In both cases
> > +			 * we can simply restart. If this vma was unmapped we
> > +			 * can pretend this insn was not executed yet and get
> > +			 * the (correct) SIGSEGV after restart.
> > +			 */
> > +			instruction_pointer_set(regs, bp_vaddr);
> > +		}
> > +		return NULL;
> > +	}
> > +
> > +	ubp = alloc_breakpoint(utask, uprobe, bp_vaddr);
> > +	if (!ubp) {
> > +		put_uprobe(uprobe);
> > +		return NULL;
> > +	}
> > +
> > +	return ubp;
> > +}
> > +
> >  static int
> >  install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
> >  			struct vm_area_struct *vma, unsigned long vaddr)
> > @@ -1576,9 +1745,8 @@ void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
> >  /*
> >   *  - search for a free slot.
> >   */
> > -static unsigned long xol_take_insn_slot(struct xol_area *area)
> > +static __always_inline int xol_take_insn_slot(struct xol_area *area)
> >  {
> > -	unsigned long slot_addr;
> >  	int slot_nr;
> >
> >  	do {
> > @@ -1590,34 +1758,48 @@ static unsigned long xol_take_insn_slot(struct xol_area *area)
> >  			slot_nr = UINSNS_PER_PAGE;
> >  			continue;
> >  		}
> > -		wait_event(area->wq, (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
> > +		wait_event(area->wq,
> > +			   (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
> >  	} while (slot_nr >= UINSNS_PER_PAGE);
> >
> > -	slot_addr = area->vaddr + (slot_nr * UPROBE_XOL_SLOT_BYTES);
> > -	atomic_inc(&area->slot_count);
> > +	return slot_nr;
> > +}
> > +
> > +static __always_inline unsigned long
> > +choose_insn_slot(struct xol_area *area, struct uprobe_breakpoint *ubp)
> > +{
> > +	if ((ubp->slot == UINSNS_PER_PAGE) ||
> > +	    test_and_set_bit(ubp->slot, area->bitmap)) {
> > +		ubp->slot = xol_take_insn_slot(area);
> > +	}
> >
> > -	return slot_addr;
> > +	atomic_inc(&area->slot_count);
> > +	return area->vaddr + ubp->slot * UPROBE_XOL_SLOT_BYTES;
> >  }
> >
> >  /*
> >   * xol_get_insn_slot - allocate a slot for xol.
> >   * Returns the allocated slot address or 0.
> >   */
> > -static unsigned long xol_get_insn_slot(struct uprobe *uprobe)
> > +static unsigned long xol_get_insn_slot(struct uprobe_breakpoint *ubp)
> >  {
> >  	struct xol_area *area;
> >  	unsigned long xol_vaddr;
> > +	struct uprobe *uprobe = ubp->uprobe;
> >
> >  	area = get_xol_area();
> >  	if (!area)
> >  		return 0;
> >
> > -	xol_vaddr = xol_take_insn_slot(area);
> > +	xol_vaddr = choose_insn_slot(area, ubp);
> >  	if (unlikely(!xol_vaddr))
> >  		return 0;
> >
> > -	arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
> > -			      &uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
> > +	if (memcmp((void *)xol_vaddr, &uprobe->arch.ixol,
> > +		   sizeof(uprobe->arch.ixol)))
> > +		arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
> > +				      &uprobe->arch.ixol,
> > +				      sizeof(uprobe->arch.ixol));
>
> Perhaps, should i move memcmp() to the arch_uprobe_copy_ixol() provided by the architecture
> code?
>
> >
> >  	return xol_vaddr;
> >  }
> > @@ -1717,8 +1899,7 @@ void uprobe_free_utask(struct task_struct *t)
> >  	if (!utask)
> >  		return;
> >
> > -	if (utask->active_uprobe)
> > -		put_uprobe(utask->active_uprobe);
> > +	delete_breakpoint(utask);
> >
> >  	ri = utask->return_instances;
> >  	while (ri)
> > @@ -1739,8 +1920,11 @@ void uprobe_free_utask(struct task_struct *t)
> >   */
> >  static struct uprobe_task *get_utask(void)
> >  {
> > -	if (!current->utask)
> > +	if (!current->utask) {
> >  		current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
> > +		current->utask->breakpoints_tree = RB_ROOT;
> > +		rwlock_init(&current->utask->breakpoints_treelock);
> > +	}
> >  	return current->utask;
> >  }
> >
> > @@ -1921,7 +2105,8 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
> >
> >  /* Prepare to single-step probed instruction out of line. */
> >  static int
> > -pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
> > +pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
> > +	  struct uprobe_breakpoint *ubp)
> >  {
> >  	struct uprobe_task *utask;
> >  	unsigned long xol_vaddr;
> > @@ -1931,12 +2116,12 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
> >  	if (!utask)
> >  		return -ENOMEM;
> >
> > -	xol_vaddr = xol_get_insn_slot(uprobe);
> > +	xol_vaddr = xol_get_insn_slot(ubp);
> >  	if (!xol_vaddr)
> >  		return -ENOMEM;
> >
> >  	utask->xol_vaddr = xol_vaddr;
> > -	utask->vaddr = bp_vaddr;
> > +	utask->vaddr = ubp->vaddr;
> >
> >  	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
> >  	if (unlikely(err)) {
> > @@ -2182,32 +2367,19 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
> >   */
> >  static void handle_swbp(struct pt_regs *regs)
> >  {
> > +	struct uprobe_breakpoint *ubp;
> >  	struct uprobe *uprobe;
> >  	unsigned long bp_vaddr;
> > -	int is_swbp;
> >
> >  	bp_vaddr = uprobe_get_swbp_addr(regs);
> >  	if (bp_vaddr == get_trampoline_vaddr())
> >  		return handle_trampoline(regs);
> >
> > -	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
> > -	if (!uprobe) {
> > -		if (is_swbp > 0) {
> > -			/* No matching uprobe; signal SIGTRAP. */
> > -			force_sig(SIGTRAP);
> > -		} else {
> > -			/*
> > -			 * Either we raced with uprobe_unregister() or we can't
> > -			 * access this memory. The latter is only possible if
> > -			 * another thread plays with our ->mm. In both cases
> > -			 * we can simply restart. If this vma was unmapped we
> > -			 * can pretend this insn was not executed yet and get
> > -			 * the (correct) SIGSEGV after restart.
> > -			 */
> > -			instruction_pointer_set(regs, bp_vaddr);
> > -		}
> > +	ubp = find_active_breakpoint(regs, bp_vaddr);
> > +	if (!ubp)
> >  		return;
> > -	}
> > +
> > +	uprobe = ubp->uprobe;
> >
> >  	/* change it in advance for ->handler() and restart */
> >  	instruction_pointer_set(regs, bp_vaddr);
> > @@ -2241,12 +2413,11 @@ static void handle_swbp(struct pt_regs *regs)
> >  	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> >  		goto out;
> >
> > -	if (!pre_ssout(uprobe, regs, bp_vaddr))
> > -		return;
> > +	pre_ssout(uprobe, regs, ubp);
> >
> >  	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
> >  out:
> > -	put_uprobe(uprobe);
> > +	put_ubp(ubp);
> >  }
> >
> >  /*
> > @@ -2266,7 +2437,6 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
> >  	else
> >  		WARN_ON_ONCE(1);
> >
> > -	put_uprobe(uprobe);
> >  	utask->active_uprobe = NULL;
> >  	utask->state = UTASK_RUNNING;
> >  	xol_free_insn_slot(current);
>
> --
> BR
> Liao, Chang
>

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 6 months ago


在 2024/8/9 2:49, Oleg Nesterov 写道:
> Hi Liao,
> 
> To be honest I didn't even try to look at your patch, sorry.
> 
> Because I think it would be better to delay it in an case. Until Andrii
> finishes/pushes his optimization changes which (in particular) include
> find_active_uprobe_rcu/etc.
> 
> Then you can rebease and re-benchmark your patch on top of these changes.

Oleg, Thanks for your guidance. I'll keep an eye on the development of Andrii's
changes and re-evaluate my patch. To ensure minimal disruption, I'll hold off
on further pushing the patch until Andrii's changes are settle down.

> 
> Oleg.
> 
> 
> On 08/08, Liao, Chang wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>> within the mailing list. Considering this interest, I've added you and other relevant
>> maintainers to the CC list for broader visibility and potential collaboration.
>>
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> ----------
>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>> Redis SET (RPS) uprobe: 42728.52
>>> Redis GET (RPS) uprobe: 43640.18
>>> Redis SET (RPS) uretprobe: 40624.54
>>> Redis GET (RPS) uretprobe: 41180.56
>>>
>>> after-opt
>>> ---------
>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>> Redis SET (RPS) uprobe: 43939.69
>>> Redis GET (RPS) uprobe: 45200.80
>>> Redis SET (RPS) uretprobe: 41658.58
>>> Redis GET (RPS) uretprobe: 42805.80
>>>
>>> While some uprobes might still need to share the same insn_slot, this
>>> patch compare the instructions in the resued insn_slot with the
>>> instructions execute out-of-line firstly to decides allocate a new one
>>> or not.
>>>
>>> Additionally, this patch use a rbtree associated with each thread that
>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>> contention, it result in faster lookup times compared to find_uprobe().
>>>
>>> The other part of this patch are some necessary memory management for
>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>> uprobe_breakpoints will be freed when thread exit.
>>>
>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>> ---
>>>  include/linux/uprobes.h |   3 +
>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>
>>> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
>>> index f46e0ca0169c..04ee465980af 100644
>>> --- a/include/linux/uprobes.h
>>> +++ b/include/linux/uprobes.h
>>> @@ -78,6 +78,9 @@ struct uprobe_task {
>>>
>>>  	struct return_instance		*return_instances;
>>>  	unsigned int			depth;
>>> +
>>> +	struct rb_root			breakpoints_tree;
>>> +	rwlock_t			breakpoints_treelock;
>>>  };
>>>
>>>  struct return_instance {
>>> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
>>> index 2c83ba776fc7..3f1a6dc2a327 100644
>>> --- a/kernel/events/uprobes.c
>>> +++ b/kernel/events/uprobes.c
>>> @@ -33,6 +33,7 @@
>>>  #define MAX_UPROBE_XOL_SLOTS		UINSNS_PER_PAGE
>>>
>>>  static struct rb_root uprobes_tree = RB_ROOT;
>>> +
>>>  /*
>>>   * allows us to skip the uprobe_mmap if there are no uprobe events active
>>>   * at this time.  Probably a fine grained per inode count is better?
>>> @@ -886,6 +887,174 @@ static bool filter_chain(struct uprobe *uprobe,
>>>  	return ret;
>>>  }
>>>
>>> +static struct uprobe_task *get_utask(void);
>>> +static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr);
>>> +static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp);
>>> +
>>> +struct uprobe_breakpoint {
>>> +	struct rb_node		rb_node;
>>> +	refcount_t		ref;
>>> +	unsigned long		slot;
>>> +	unsigned long		vaddr;
>>> +	struct uprobe		*uprobe;
>>> +};
>>> +
>>> +static void put_ubp(struct uprobe_breakpoint *ubp)
>>> +{
>>> +	if (refcount_dec_and_test(&ubp->ref)) {
>>> +		put_uprobe(ubp->uprobe);
>>> +		kfree(ubp);
>>> +	}
>>> +}
>>> +
>>> +static struct uprobe_breakpoint *get_ubp(struct uprobe_breakpoint *ubp)
>>> +{
>>> +	refcount_inc(&ubp->ref);
>>> +	return ubp;
>>> +}
>>> +
>>> +#define __node_2_ubp(node) rb_entry((node), struct uprobe_breakpoint, rb_node)
>>> +
>>> +struct __ubp_key {
>>> +	unsigned long bp_vaddr;
>>> +};
>>> +
>>> +static int ubp_cmp(const unsigned long bp_vaddr,
>>> +		   const struct uprobe_breakpoint *ubp)
>>> +{
>>> +	if (bp_vaddr < ubp->vaddr)
>>> +		return -1;
>>> +
>>> +	if (bp_vaddr > ubp->vaddr)
>>> +		return 1;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int __ubp_cmp_key(const void *k, const struct rb_node *b)
>>> +{
>>> +	const struct __ubp_key *key = k;
>>> +
>>> +	return ubp_cmp(key->bp_vaddr, __node_2_ubp(b));
>>> +}
>>> +
>>> +static int __ubp_cmp(struct rb_node *a, const struct rb_node *b)
>>> +{
>>> +	const struct uprobe_breakpoint *ubp = __node_2_ubp(a);
>>> +
>>> +	return ubp_cmp(ubp->vaddr, __node_2_ubp(b));
>>> +}
>>> +
>>> +static void delete_breakpoint(struct uprobe_task *utask)
>>> +{
>>> +	struct rb_node *node;
>>> +
>>> +	write_lock(&utask->breakpoints_treelock);
>>> +	node = rb_first(&utask->breakpoints_tree);
>>> +	while (node) {
>>> +		rb_erase(node, &utask->breakpoints_tree);
>>> +		write_unlock(&utask->breakpoints_treelock);
>>> +
>>> +		put_ubp(__node_2_ubp(node));
>>> +
>>> +		write_lock(&utask->breakpoints_treelock);
>>> +		node = rb_next(node);
>>> +	}
>>> +	write_unlock(&utask->breakpoints_treelock);
>>> +}
>>> +
>>> +static struct uprobe_breakpoint *alloc_breakpoint(struct uprobe_task *utask,
>>> +						  struct uprobe *uprobe,
>>> +						  unsigned long bp_vaddr)
>>> +{
>>> +	struct uprobe_breakpoint *ubp;
>>> +	struct rb_node *node;
>>> +
>>> +	ubp = kzalloc(sizeof(struct uprobe_breakpoint), GFP_KERNEL);
>>> +	if (!ubp)
>>> +		return NULL;
>>> +
>>> +	ubp->vaddr = bp_vaddr;
>>> +	ubp->uprobe = uprobe;
>>> +	/* get access + creation ref */
>>> +	refcount_set(&ubp->ref, 2);
>>> +	ubp->slot = UINSNS_PER_PAGE;
>>> +
>>> +	write_lock(&utask->breakpoints_treelock);
>>> +	node = rb_find_add(&ubp->rb_node, &utask->breakpoints_tree, __ubp_cmp);
>>> +	write_unlock(&utask->breakpoints_treelock);
>>> +
>>> +	/* Two or more threads hit the same breakpoint */
>>> +	if (node) {
>>> +		WARN_ON(uprobe != __node_2_ubp(node)->upobre);
>>
>> A stupid typo.
>>
>> s/upobre/uprobe/g
>>
>>> +		kfree(ubp);
>>> +		return get_ubp(__node_2_ubp(node));
>>> +	}
>>> +
>>> +	return ubp;
>>> +}
>>> +
>>> +static struct uprobe_breakpoint *find_breakpoint(struct uprobe_task *utask,
>>> +						 unsigned long bp_vaddr)
>>> +{
>>> +	struct rb_node *node;
>>> +	struct __ubp_key key = {
>>> +		.bp_vaddr = bp_vaddr,
>>> +	};
>>> +
>>> +	read_lock(&utask->breakpoints_treelock);
>>> +	node = rb_find(&key, &utask->breakpoints_tree, __ubp_cmp_key);
>>> +	read_unlock(&utask->breakpoints_treelock);
>>> +
>>> +	if (node)
>>> +		return get_ubp(__node_2_ubp(node));
>>> +
>>> +	return NULL;
>>> +}
>>> +
>>> +static struct uprobe_breakpoint *find_active_breakpoint(struct pt_regs *regs,
>>> +							unsigned long bp_vaddr)
>>> +{
>>> +	struct uprobe_task *utask = get_utask();
>>> +	struct uprobe_breakpoint *ubp;
>>> +	struct uprobe *uprobe;
>>> +	int is_swbp;
>>> +
>>> +	if (unlikely(!utask))
>>> +		return NULL;
>>> +
>>> +	ubp = find_breakpoint(utask, bp_vaddr);
>>> +	if (ubp)
>>> +		return ubp;
>>> +
>>> +	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
>>> +	if (!uprobe) {
>>> +		if (is_swbp > 0) {
>>> +			/* No matching uprobe; signal SIGTRAP. */
>>> +			force_sig(SIGTRAP);
>>> +		} else {
>>> +			/*
>>> +			 * Either we raced with uprobe_unregister() or we can't
>>> +			 * access this memory. The latter is only possible if
>>> +			 * another thread plays with our ->mm. In both cases
>>> +			 * we can simply restart. If this vma was unmapped we
>>> +			 * can pretend this insn was not executed yet and get
>>> +			 * the (correct) SIGSEGV after restart.
>>> +			 */
>>> +			instruction_pointer_set(regs, bp_vaddr);
>>> +		}
>>> +		return NULL;
>>> +	}
>>> +
>>> +	ubp = alloc_breakpoint(utask, uprobe, bp_vaddr);
>>> +	if (!ubp) {
>>> +		put_uprobe(uprobe);
>>> +		return NULL;
>>> +	}
>>> +
>>> +	return ubp;
>>> +}
>>> +
>>>  static int
>>>  install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm,
>>>  			struct vm_area_struct *vma, unsigned long vaddr)
>>> @@ -1576,9 +1745,8 @@ void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
>>>  /*
>>>   *  - search for a free slot.
>>>   */
>>> -static unsigned long xol_take_insn_slot(struct xol_area *area)
>>> +static __always_inline int xol_take_insn_slot(struct xol_area *area)
>>>  {
>>> -	unsigned long slot_addr;
>>>  	int slot_nr;
>>>
>>>  	do {
>>> @@ -1590,34 +1758,48 @@ static unsigned long xol_take_insn_slot(struct xol_area *area)
>>>  			slot_nr = UINSNS_PER_PAGE;
>>>  			continue;
>>>  		}
>>> -		wait_event(area->wq, (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
>>> +		wait_event(area->wq,
>>> +			   (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
>>>  	} while (slot_nr >= UINSNS_PER_PAGE);
>>>
>>> -	slot_addr = area->vaddr + (slot_nr * UPROBE_XOL_SLOT_BYTES);
>>> -	atomic_inc(&area->slot_count);
>>> +	return slot_nr;
>>> +}
>>> +
>>> +static __always_inline unsigned long
>>> +choose_insn_slot(struct xol_area *area, struct uprobe_breakpoint *ubp)
>>> +{
>>> +	if ((ubp->slot == UINSNS_PER_PAGE) ||
>>> +	    test_and_set_bit(ubp->slot, area->bitmap)) {
>>> +		ubp->slot = xol_take_insn_slot(area);
>>> +	}
>>>
>>> -	return slot_addr;
>>> +	atomic_inc(&area->slot_count);
>>> +	return area->vaddr + ubp->slot * UPROBE_XOL_SLOT_BYTES;
>>>  }
>>>
>>>  /*
>>>   * xol_get_insn_slot - allocate a slot for xol.
>>>   * Returns the allocated slot address or 0.
>>>   */
>>> -static unsigned long xol_get_insn_slot(struct uprobe *uprobe)
>>> +static unsigned long xol_get_insn_slot(struct uprobe_breakpoint *ubp)
>>>  {
>>>  	struct xol_area *area;
>>>  	unsigned long xol_vaddr;
>>> +	struct uprobe *uprobe = ubp->uprobe;
>>>
>>>  	area = get_xol_area();
>>>  	if (!area)
>>>  		return 0;
>>>
>>> -	xol_vaddr = xol_take_insn_slot(area);
>>> +	xol_vaddr = choose_insn_slot(area, ubp);
>>>  	if (unlikely(!xol_vaddr))
>>>  		return 0;
>>>
>>> -	arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
>>> -			      &uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
>>> +	if (memcmp((void *)xol_vaddr, &uprobe->arch.ixol,
>>> +		   sizeof(uprobe->arch.ixol)))
>>> +		arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
>>> +				      &uprobe->arch.ixol,
>>> +				      sizeof(uprobe->arch.ixol));
>>
>> Perhaps, should i move memcmp() to the arch_uprobe_copy_ixol() provided by the architecture
>> code?
>>
>>>
>>>  	return xol_vaddr;
>>>  }
>>> @@ -1717,8 +1899,7 @@ void uprobe_free_utask(struct task_struct *t)
>>>  	if (!utask)
>>>  		return;
>>>
>>> -	if (utask->active_uprobe)
>>> -		put_uprobe(utask->active_uprobe);
>>> +	delete_breakpoint(utask);
>>>
>>>  	ri = utask->return_instances;
>>>  	while (ri)
>>> @@ -1739,8 +1920,11 @@ void uprobe_free_utask(struct task_struct *t)
>>>   */
>>>  static struct uprobe_task *get_utask(void)
>>>  {
>>> -	if (!current->utask)
>>> +	if (!current->utask) {
>>>  		current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
>>> +		current->utask->breakpoints_tree = RB_ROOT;
>>> +		rwlock_init(&current->utask->breakpoints_treelock);
>>> +	}
>>>  	return current->utask;
>>>  }
>>>
>>> @@ -1921,7 +2105,8 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>>>
>>>  /* Prepare to single-step probed instruction out of line. */
>>>  static int
>>> -pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
>>> +pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
>>> +	  struct uprobe_breakpoint *ubp)
>>>  {
>>>  	struct uprobe_task *utask;
>>>  	unsigned long xol_vaddr;
>>> @@ -1931,12 +2116,12 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
>>>  	if (!utask)
>>>  		return -ENOMEM;
>>>
>>> -	xol_vaddr = xol_get_insn_slot(uprobe);
>>> +	xol_vaddr = xol_get_insn_slot(ubp);
>>>  	if (!xol_vaddr)
>>>  		return -ENOMEM;
>>>
>>>  	utask->xol_vaddr = xol_vaddr;
>>> -	utask->vaddr = bp_vaddr;
>>> +	utask->vaddr = ubp->vaddr;
>>>
>>>  	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
>>>  	if (unlikely(err)) {
>>> @@ -2182,32 +2367,19 @@ bool __weak arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
>>>   */
>>>  static void handle_swbp(struct pt_regs *regs)
>>>  {
>>> +	struct uprobe_breakpoint *ubp;
>>>  	struct uprobe *uprobe;
>>>  	unsigned long bp_vaddr;
>>> -	int is_swbp;
>>>
>>>  	bp_vaddr = uprobe_get_swbp_addr(regs);
>>>  	if (bp_vaddr == get_trampoline_vaddr())
>>>  		return handle_trampoline(regs);
>>>
>>> -	uprobe = find_active_uprobe(bp_vaddr, &is_swbp);
>>> -	if (!uprobe) {
>>> -		if (is_swbp > 0) {
>>> -			/* No matching uprobe; signal SIGTRAP. */
>>> -			force_sig(SIGTRAP);
>>> -		} else {
>>> -			/*
>>> -			 * Either we raced with uprobe_unregister() or we can't
>>> -			 * access this memory. The latter is only possible if
>>> -			 * another thread plays with our ->mm. In both cases
>>> -			 * we can simply restart. If this vma was unmapped we
>>> -			 * can pretend this insn was not executed yet and get
>>> -			 * the (correct) SIGSEGV after restart.
>>> -			 */
>>> -			instruction_pointer_set(regs, bp_vaddr);
>>> -		}
>>> +	ubp = find_active_breakpoint(regs, bp_vaddr);
>>> +	if (!ubp)
>>>  		return;
>>> -	}
>>> +
>>> +	uprobe = ubp->uprobe;
>>>
>>>  	/* change it in advance for ->handler() and restart */
>>>  	instruction_pointer_set(regs, bp_vaddr);
>>> @@ -2241,12 +2413,11 @@ static void handle_swbp(struct pt_regs *regs)
>>>  	if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>  		goto out;
>>>
>>> -	if (!pre_ssout(uprobe, regs, bp_vaddr))
>>> -		return;
>>> +	pre_ssout(uprobe, regs, ubp);
>>>
>>>  	/* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */
>>>  out:
>>> -	put_uprobe(uprobe);
>>> +	put_ubp(ubp);
>>>  }
>>>
>>>  /*
>>> @@ -2266,7 +2437,6 @@ static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
>>>  	else
>>>  		WARN_ON_ONCE(1);
>>>
>>> -	put_uprobe(uprobe);
>>>  	utask->active_uprobe = NULL;
>>>  	utask->state = UTASK_RUNNING;
>>>  	xol_free_insn_slot(current);
>>
>> --
>> BR
>> Liao, Chang
>>
> 
> 

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 6 months ago

On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
> Hi Andrii and Oleg.
>
> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> on arm64. I notice recent discussions on the performance and scalability of uprobes
> within the mailing list. Considering this interest, I've added you and other relevant
> maintainers to the CC list for broader visibility and potential collaboration.
>

Hi Liao,

As you can see there is an active work to improve uprobes, that
changes lifetime management of uprobes, removes a bunch of locks taken
in the uprobe/uretprobe hot path, etc. It would be nice if you can
hold off a bit with your changes until all that lands. And then
re-benchmark, as costs might shift.

But also see some remarks below.

> Thanks.
>
> 在 2024/7/27 17:44, Liao Chang 写道:
> > The profiling result of single-thread model of selftests bench reveals
> > performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> > ARM64. On my local testing machine, 5% of CPU time is consumed by
> > find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> > about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >
> > This patch introduce struct uprobe_breakpoint to track previously
> > allocated insn_slot for frequently hit uprobe. it effectively reduce the
> > need for redundant insn_slot writes and subsequent expensive cache
> > flush, especially on architecture like ARM64. This patch has been tested
> > on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> > bench and Redis GET/SET benchmark result below reveal obivious
> > performance gain.
> >
> > before-opt
> > ----------
> > trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> > trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> > trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)

I'm surprised that nop and push variants are much slower than ret
variant. This is exactly opposite on x86-64. Do you have an
explanation why this might be happening? I see you are trying to
optimize xol_get_insn_slot(), but that is (at least for x86) a slow
variant of uprobe that normally shouldn't be used. Typically uprobe is
installed on nop (for USDT) and on function entry (which would be push
variant, `push %rbp` instruction).

ret variant, for x86-64, causes one extra step to go back to user
space to execute original instruction out-of-line, and then trapping
back to kernel for running uprobe. Which is what you normally want to
avoid.

What I'm getting at here. It seems like maybe arm arch is missing fast
emulated implementations for nops/push or whatever equivalents for
ARM64 that is. Please take a look at that and see why those are slow
and whether you can make those into fast uprobe cases?

> > trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> > trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> > trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> > Redis SET (RPS) uprobe: 42728.52
> > Redis GET (RPS) uprobe: 43640.18
> > Redis SET (RPS) uretprobe: 40624.54
> > Redis GET (RPS) uretprobe: 41180.56
> >
> > after-opt
> > ---------
> > trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> > trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> > trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> > trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> > trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> > trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> > Redis SET (RPS) uprobe: 43939.69
> > Redis GET (RPS) uprobe: 45200.80
> > Redis SET (RPS) uretprobe: 41658.58
> > Redis GET (RPS) uretprobe: 42805.80
> >
> > While some uprobes might still need to share the same insn_slot, this
> > patch compare the instructions in the resued insn_slot with the
> > instructions execute out-of-line firstly to decides allocate a new one
> > or not.
> >
> > Additionally, this patch use a rbtree associated with each thread that
> > hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> > rbtree of uprobe_breakpoints has smaller node, better locality and less
> > contention, it result in faster lookup times compared to find_uprobe().
> >
> > The other part of this patch are some necessary memory management for
> > uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> > hit uprobe that doesn't already have a corresponding node in rbtree. All
> > uprobe_breakpoints will be freed when thread exit.
> >
> > Signed-off-by: Liao Chang <liaochang1@huawei.com>
> > ---
> >  include/linux/uprobes.h |   3 +
> >  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> >  2 files changed, 211 insertions(+), 38 deletions(-)
> >

[...]

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 5 months ago


在 2024/8/9 2:26, Andrii Nakryiko 写道:
> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>> within the mailing list. Considering this interest, I've added you and other relevant
>> maintainers to the CC list for broader visibility and potential collaboration.
>>
> 
> Hi Liao,
> 
> As you can see there is an active work to improve uprobes, that
> changes lifetime management of uprobes, removes a bunch of locks taken
> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> hold off a bit with your changes until all that lands. And then
> re-benchmark, as costs might shift.
> 
> But also see some remarks below.
> 
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> ----------
>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> 
> I'm surprised that nop and push variants are much slower than ret
> variant. This is exactly opposite on x86-64. Do you have an
> explanation why this might be happening? I see you are trying to
> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> variant of uprobe that normally shouldn't be used. Typically uprobe is
> installed on nop (for USDT) and on function entry (which would be push
> variant, `push %rbp` instruction).
> 
> ret variant, for x86-64, causes one extra step to go back to user
> space to execute original instruction out-of-line, and then trapping
> back to kernel for running uprobe. Which is what you normally want to
> avoid.
> 
> What I'm getting at here. It seems like maybe arm arch is missing fast
> emulated implementations for nops/push or whatever equivalents for
> ARM64 that is. Please take a look at that and see why those are slow
> and whether you can make those into fast uprobe cases?

Hi Andrii,

As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
compared to X86 behavior. My investigation revealed that the root cause lies in
the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
to handle these instructions out-of-line in userspace upon breakpoint exception
is handled, leading to a significant performance overhead compared to 'ret' variant,
which is already emulated.

To address this issue, I've developed a patch supports  the emulation of 'nop' and
'push' variants. The benchmark results below indicates the performance gain of
emulation is obivious.

xol (1 cpus)
------------
uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
-------------------
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)

As you can see, the performance gap between nop/push and ret variants has been significantly
reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
more cycles than the other.

> 
>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>> Redis SET (RPS) uprobe: 42728.52
>>> Redis GET (RPS) uprobe: 43640.18
>>> Redis SET (RPS) uretprobe: 40624.54
>>> Redis GET (RPS) uretprobe: 41180.56
>>>
>>> after-opt
>>> ---------
>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>> Redis SET (RPS) uprobe: 43939.69
>>> Redis GET (RPS) uprobe: 45200.80
>>> Redis SET (RPS) uretprobe: 41658.58
>>> Redis GET (RPS) uretprobe: 42805.80
>>>
>>> While some uprobes might still need to share the same insn_slot, this
>>> patch compare the instructions in the resued insn_slot with the
>>> instructions execute out-of-line firstly to decides allocate a new one
>>> or not.
>>>
>>> Additionally, this patch use a rbtree associated with each thread that
>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>> contention, it result in faster lookup times compared to find_uprobe().
>>>
>>> The other part of this patch are some necessary memory management for
>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>> uprobe_breakpoints will be freed when thread exit.
>>>
>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>> ---
>>>  include/linux/uprobes.h |   3 +
>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>
> 
> [...]

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 5 months ago

On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> > On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >> Hi Andrii and Oleg.
> >>
> >> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >> within the mailing list. Considering this interest, I've added you and other relevant
> >> maintainers to the CC list for broader visibility and potential collaboration.
> >>
> >
> > Hi Liao,
> >
> > As you can see there is an active work to improve uprobes, that
> > changes lifetime management of uprobes, removes a bunch of locks taken
> > in the uprobe/uretprobe hot path, etc. It would be nice if you can
> > hold off a bit with your changes until all that lands. And then
> > re-benchmark, as costs might shift.
> >
> > But also see some remarks below.
> >
> >> Thanks.
> >>
> >> 在 2024/7/27 17:44, Liao Chang 写道:
> >>> The profiling result of single-thread model of selftests bench reveals
> >>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> >>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> >>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> >>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >>>
> >>> This patch introduce struct uprobe_breakpoint to track previously
> >>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> >>> need for redundant insn_slot writes and subsequent expensive cache
> >>> flush, especially on architecture like ARM64. This patch has been tested
> >>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> >>> bench and Redis GET/SET benchmark result below reveal obivious
> >>> performance gain.
> >>>
> >>> before-opt
> >>> ----------
> >>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> >>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> >>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> >
> > I'm surprised that nop and push variants are much slower than ret
> > variant. This is exactly opposite on x86-64. Do you have an
> > explanation why this might be happening? I see you are trying to
> > optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> > variant of uprobe that normally shouldn't be used. Typically uprobe is
> > installed on nop (for USDT) and on function entry (which would be push
> > variant, `push %rbp` instruction).
> >
> > ret variant, for x86-64, causes one extra step to go back to user
> > space to execute original instruction out-of-line, and then trapping
> > back to kernel for running uprobe. Which is what you normally want to
> > avoid.
> >
> > What I'm getting at here. It seems like maybe arm arch is missing fast
> > emulated implementations for nops/push or whatever equivalents for
> > ARM64 that is. Please take a look at that and see why those are slow
> > and whether you can make those into fast uprobe cases?
>
> Hi Andrii,
>
> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
> compared to X86 behavior. My investigation revealed that the root cause lies in
> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
> to handle these instructions out-of-line in userspace upon breakpoint exception
> is handled, leading to a significant performance overhead compared to 'ret' variant,
> which is already emulated.
>
> To address this issue, I've developed a patch supports  the emulation of 'nop' and
> 'push' variants. The benchmark results below indicates the performance gain of
> emulation is obivious.
>
> xol (1 cpus)
> ------------
> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>
> emulation (1 cpus)
> -------------------
> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
>
> As you can see, the performance gap between nop/push and ret variants has been significantly
> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
> more cycles than the other.

Great, it's an obvious improvement. Are you going to send patches
upstream? Please cc bpf@vger.kernel.org as well.


I'm also thinking we should update uprobe/uretprobe benchmarks to be
less x86-specific. Right now "-nop" is the happy fastest case, "-push"
is still happy, slightly slower case (due to the need to emulate stack
operation) and "-ret" is meant to be the slow single-step case. We
should adjust the naming and make sure that on ARM64 we hit similar
code paths. Given you seem to know arm64 pretty well, can you please
take a look at updating bench tool for ARM64 (we can also rename
benchmarks to something a bit more generic, rather than using
instruction names)?

>
> >
> >>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> >>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> >>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> >>> Redis SET (RPS) uprobe: 42728.52
> >>> Redis GET (RPS) uprobe: 43640.18
> >>> Redis SET (RPS) uretprobe: 40624.54
> >>> Redis GET (RPS) uretprobe: 41180.56
> >>>
> >>> after-opt
> >>> ---------
> >>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>> Redis SET (RPS) uprobe: 43939.69
> >>> Redis GET (RPS) uprobe: 45200.80
> >>> Redis SET (RPS) uretprobe: 41658.58
> >>> Redis GET (RPS) uretprobe: 42805.80
> >>>
> >>> While some uprobes might still need to share the same insn_slot, this
> >>> patch compare the instructions in the resued insn_slot with the
> >>> instructions execute out-of-line firstly to decides allocate a new one
> >>> or not.
> >>>
> >>> Additionally, this patch use a rbtree associated with each thread that
> >>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> >>> rbtree of uprobe_breakpoints has smaller node, better locality and less
> >>> contention, it result in faster lookup times compared to find_uprobe().
> >>>
> >>> The other part of this patch are some necessary memory management for
> >>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> >>> hit uprobe that doesn't already have a corresponding node in rbtree. All
> >>> uprobe_breakpoints will be freed when thread exit.
> >>>
> >>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>> ---
> >>>  include/linux/uprobes.h |   3 +
> >>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> >>>  2 files changed, 211 insertions(+), 38 deletions(-)
> >>>
> >
> > [...]
>
> --
> BR
> Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 5 months ago


在 2024/8/13 1:49, Andrii Nakryiko 写道:
> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>
>>>> Hi Andrii and Oleg.
>>>>
>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>
>>>
>>> Hi Liao,
>>>
>>> As you can see there is an active work to improve uprobes, that
>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>> hold off a bit with your changes until all that lands. And then
>>> re-benchmark, as costs might shift.
>>>
>>> But also see some remarks below.
>>>
>>>> Thanks.
>>>>
>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>
>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>> performance gain.
>>>>>
>>>>> before-opt
>>>>> ----------
>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>
>>> I'm surprised that nop and push variants are much slower than ret
>>> variant. This is exactly opposite on x86-64. Do you have an
>>> explanation why this might be happening? I see you are trying to
>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>> installed on nop (for USDT) and on function entry (which would be push
>>> variant, `push %rbp` instruction).
>>>
>>> ret variant, for x86-64, causes one extra step to go back to user
>>> space to execute original instruction out-of-line, and then trapping
>>> back to kernel for running uprobe. Which is what you normally want to
>>> avoid.
>>>
>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>> emulated implementations for nops/push or whatever equivalents for
>>> ARM64 that is. Please take a look at that and see why those are slow
>>> and whether you can make those into fast uprobe cases?
>>
>> Hi Andrii,
>>
>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
>> compared to X86 behavior. My investigation revealed that the root cause lies in
>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
>> to handle these instructions out-of-line in userspace upon breakpoint exception
>> is handled, leading to a significant performance overhead compared to 'ret' variant,
>> which is already emulated.
>>
>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
>> 'push' variants. The benchmark results below indicates the performance gain of
>> emulation is obivious.
>>
>> xol (1 cpus)
>> ------------
>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>
>> emulation (1 cpus)
>> -------------------
>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
>>
>> As you can see, the performance gap between nop/push and ret variants has been significantly
>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
>> more cycles than the other.
> 
> Great, it's an obvious improvement. Are you going to send patches
> upstream? Please cc bpf@vger.kernel.org as well.

I'll need more time to thoroughly test this patch. The emulation o push/nop
instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
regression.

> 
> 
> I'm also thinking we should update uprobe/uretprobe benchmarks to be
> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
> is still happy, slightly slower case (due to the need to emulate stack
> operation) and "-ret" is meant to be the slow single-step case. We
> should adjust the naming and make sure that on ARM64 we hit similar
> code paths. Given you seem to know arm64 pretty well, can you please
> take a look at updating bench tool for ARM64 (we can also rename
> benchmarks to something a bit more generic, rather than using
> instruction names)?

Let me use a matrix below for the structured comparsion of uprobe/uretprobe
benchmarks on X86 and Arm64:

Architecture  Instrution Type   Handling method   Performance
X86           nop               Emulated          Fastest
X86           push              Emulated          Fast
X86           ret               Single-step       Slow
Arm64         nop               Emulated          Fastest
Arm64         push              Emulated          Fast
Arm64         ret               Emulated          Faster

I suggest categorize benchmarks into 'emu' for emulated instructions and 'ss'
for 'single-steppable' instructions. Generally, emulated instructions should
outperform single-step ones across different architectures. Regarding the
generic naming, I propose using a self-explanatory style, such as
s/nop/empty-insn/g, s/push/push-stack/g, s/ret/func-return/g.

Above all, example "bench --list" output:

X86:
  ...
  trig-uprobe-emu-empty-insn
  trig-uprobe-ss-func-return
  trig-uprobe-emu-push-stack
  trig-uretprobe-emu-empyt-insn
  trig-uretprobe-ss-func-return
  trig-uretprobe-emu-push-stack
  ...

Arm64:
  ...
  trig-uprobe-emu-empty-insn
  trig-uprobe-emu-func-return
  trig-uprobe-emu-push-stack
  trig-uretprobe-emu-empyt-insn
  trig-uretprobe-emu-func-return
  trig-uretprobe-emu-push-stack
  ...

This structure will allow for direct comparison of uprobe/uretprobe
performance across different architectures and instruction types.
Please let me know your thought, Andrii.

Thanks.

> 
>>
>>>
>>>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>>>> Redis SET (RPS) uprobe: 42728.52
>>>>> Redis GET (RPS) uprobe: 43640.18
>>>>> Redis SET (RPS) uretprobe: 40624.54
>>>>> Redis GET (RPS) uretprobe: 41180.56
>>>>>
>>>>> after-opt
>>>>> ---------
>>>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>> Redis SET (RPS) uprobe: 43939.69
>>>>> Redis GET (RPS) uprobe: 45200.80
>>>>> Redis SET (RPS) uretprobe: 41658.58
>>>>> Redis GET (RPS) uretprobe: 42805.80
>>>>>
>>>>> While some uprobes might still need to share the same insn_slot, this
>>>>> patch compare the instructions in the resued insn_slot with the
>>>>> instructions execute out-of-line firstly to decides allocate a new one
>>>>> or not.
>>>>>
>>>>> Additionally, this patch use a rbtree associated with each thread that
>>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>>>> contention, it result in faster lookup times compared to find_uprobe().
>>>>>
>>>>> The other part of this patch are some necessary memory management for
>>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>>>> uprobe_breakpoints will be freed when thread exit.
>>>>>
>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>> ---
>>>>>  include/linux/uprobes.h |   3 +
>>>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>>>
>>>
>>> [...]
>>
>> --
>> BR
>> Liao, Chang

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 5 months ago

On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
> > On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >>
> >>
> >> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> >>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>>
> >>>> Hi Andrii and Oleg.
> >>>>
> >>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >>>> within the mailing list. Considering this interest, I've added you and other relevant
> >>>> maintainers to the CC list for broader visibility and potential collaboration.
> >>>>
> >>>
> >>> Hi Liao,
> >>>
> >>> As you can see there is an active work to improve uprobes, that
> >>> changes lifetime management of uprobes, removes a bunch of locks taken
> >>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> >>> hold off a bit with your changes until all that lands. And then
> >>> re-benchmark, as costs might shift.
> >>>
> >>> But also see some remarks below.
> >>>
> >>>> Thanks.
> >>>>
> >>>> 在 2024/7/27 17:44, Liao Chang 写道:
> >>>>> The profiling result of single-thread model of selftests bench reveals
> >>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> >>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> >>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> >>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >>>>>
> >>>>> This patch introduce struct uprobe_breakpoint to track previously
> >>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> >>>>> need for redundant insn_slot writes and subsequent expensive cache
> >>>>> flush, especially on architecture like ARM64. This patch has been tested
> >>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> >>>>> bench and Redis GET/SET benchmark result below reveal obivious
> >>>>> performance gain.
> >>>>>
> >>>>> before-opt
> >>>>> ----------
> >>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> >>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> >>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> >>>
> >>> I'm surprised that nop and push variants are much slower than ret
> >>> variant. This is exactly opposite on x86-64. Do you have an
> >>> explanation why this might be happening? I see you are trying to
> >>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> >>> variant of uprobe that normally shouldn't be used. Typically uprobe is
> >>> installed on nop (for USDT) and on function entry (which would be push
> >>> variant, `push %rbp` instruction).
> >>>
> >>> ret variant, for x86-64, causes one extra step to go back to user
> >>> space to execute original instruction out-of-line, and then trapping
> >>> back to kernel for running uprobe. Which is what you normally want to
> >>> avoid.
> >>>
> >>> What I'm getting at here. It seems like maybe arm arch is missing fast
> >>> emulated implementations for nops/push or whatever equivalents for
> >>> ARM64 that is. Please take a look at that and see why those are slow
> >>> and whether you can make those into fast uprobe cases?
> >>
> >> Hi Andrii,
> >>
> >> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
> >> compared to X86 behavior. My investigation revealed that the root cause lies in
> >> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
> >> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
> >> to handle these instructions out-of-line in userspace upon breakpoint exception
> >> is handled, leading to a significant performance overhead compared to 'ret' variant,
> >> which is already emulated.
> >>
> >> To address this issue, I've developed a patch supports  the emulation of 'nop' and
> >> 'push' variants. The benchmark results below indicates the performance gain of
> >> emulation is obivious.
> >>
> >> xol (1 cpus)
> >> ------------
> >> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>
> >> emulation (1 cpus)
> >> -------------------
> >> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
> >> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
> >> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
> >> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
> >> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
> >> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
> >>
> >> As you can see, the performance gap between nop/push and ret variants has been significantly
> >> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
> >> more cycles than the other.
> >
> > Great, it's an obvious improvement. Are you going to send patches
> > upstream? Please cc bpf@vger.kernel.org as well.
>
> I'll need more time to thoroughly test this patch. The emulation o push/nop
> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
> regression.

Why would the *benchmarks* have to be modified? The typical
kprobe/kretprobe attachment should be fast, and those benchmarks
simulate typical fast path kprobe/kretprobe. Is there some simulation
logic that is shared between uprobes and kprobes or something?

>
> >
> >
> > I'm also thinking we should update uprobe/uretprobe benchmarks to be
> > less x86-specific. Right now "-nop" is the happy fastest case, "-push"
> > is still happy, slightly slower case (due to the need to emulate stack
> > operation) and "-ret" is meant to be the slow single-step case. We
> > should adjust the naming and make sure that on ARM64 we hit similar
> > code paths. Given you seem to know arm64 pretty well, can you please
> > take a look at updating bench tool for ARM64 (we can also rename
> > benchmarks to something a bit more generic, rather than using
> > instruction names)?
>

[...]

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 5 months ago


在 2024/8/15 2:42, Andrii Nakryiko 写道:
> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>>>
>>>>>> Hi Andrii and Oleg.
>>>>>>
>>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>>>
>>>>>
>>>>> Hi Liao,
>>>>>
>>>>> As you can see there is an active work to improve uprobes, that
>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>> hold off a bit with your changes until all that lands. And then
>>>>> re-benchmark, as costs might shift.
>>>>>
>>>>> But also see some remarks below.
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>
>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>> performance gain.
>>>>>>>
>>>>>>> before-opt
>>>>>>> ----------
>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>
>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>> explanation why this might be happening? I see you are trying to
>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>> variant, `push %rbp` instruction).
>>>>>
>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>> space to execute original instruction out-of-line, and then trapping
>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>>> avoid.
>>>>>
>>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>>> emulated implementations for nops/push or whatever equivalents for
>>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>>> and whether you can make those into fast uprobe cases?
>>>>
>>>> Hi Andrii,
>>>>
>>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
>>>> compared to X86 behavior. My investigation revealed that the root cause lies in
>>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
>>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
>>>> to handle these instructions out-of-line in userspace upon breakpoint exception
>>>> is handled, leading to a significant performance overhead compared to 'ret' variant,
>>>> which is already emulated.
>>>>
>>>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
>>>> 'push' variants. The benchmark results below indicates the performance gain of
>>>> emulation is obivious.
>>>>
>>>> xol (1 cpus)
>>>> ------------
>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>
>>>> emulation (1 cpus)
>>>> -------------------
>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
>>>>
>>>> As you can see, the performance gap between nop/push and ret variants has been significantly
>>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
>>>> more cycles than the other.
>>>
>>> Great, it's an obvious improvement. Are you going to send patches
>>> upstream? Please cc bpf@vger.kernel.org as well.
>>
>> I'll need more time to thoroughly test this patch. The emulation o push/nop
>> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
>> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
>> regression.
> 
> Why would the *benchmarks* have to be modified? The typical
> kprobe/kretprobe attachment should be fast, and those benchmarks
> simulate typical fast path kprobe/kretprobe. Is there some simulation
> logic that is shared between uprobes and kprobes or something?

Yes, kprobe and uprobe share many things for Arm64, but there are curical
difference. Let me explain further. Simulating a 'push' instruction on
arm64 will modify the stack pointer at *probe breakpoint. However, kprobe
and uprobe use different way to restore the stack pointer upon returning
from the breakpoint exception. Consequently.sharing the same simulation
logic for both would result in kernel panic for kprobe.

To avoid complicating the exception return logic, I've opted to simuate
'push' only for uprobe and maintain the single-stepping for kprobe [0].
This trade-off avoid the impacts to kprobe/kretprobe, and no need to
change the kprobe/kretprobe related benchmark.

[0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@huawei.com/

> 
>>
>>>
>>>
>>> I'm also thinking we should update uprobe/uretprobe benchmarks to be
>>> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
>>> is still happy, slightly slower case (due to the need to emulate stack
>>> operation) and "-ret" is meant to be the slow single-step case. We
>>> should adjust the naming and make sure that on ARM64 we hit similar
>>> code paths. Given you seem to know arm64 pretty well, can you please
>>> take a look at updating bench tool for ARM64 (we can also rename
>>> benchmarks to something a bit more generic, rather than using
>>> instruction names)?
>>
> 
> [...]

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 5 months ago

On Wed, Aug 14, 2024 at 7:58 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/15 2:42, Andrii Nakryiko 写道:
> > On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >>
> >>
> >> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
> >>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> >>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>>>>
> >>>>>> Hi Andrii and Oleg.
> >>>>>>
> >>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >>>>>> within the mailing list. Considering this interest, I've added you and other relevant
> >>>>>> maintainers to the CC list for broader visibility and potential collaboration.
> >>>>>>
> >>>>>
> >>>>> Hi Liao,
> >>>>>
> >>>>> As you can see there is an active work to improve uprobes, that
> >>>>> changes lifetime management of uprobes, removes a bunch of locks taken
> >>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> >>>>> hold off a bit with your changes until all that lands. And then
> >>>>> re-benchmark, as costs might shift.
> >>>>>
> >>>>> But also see some remarks below.
> >>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
> >>>>>>> The profiling result of single-thread model of selftests bench reveals
> >>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> >>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> >>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> >>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >>>>>>>
> >>>>>>> This patch introduce struct uprobe_breakpoint to track previously
> >>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> >>>>>>> need for redundant insn_slot writes and subsequent expensive cache
> >>>>>>> flush, especially on architecture like ARM64. This patch has been tested
> >>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> >>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
> >>>>>>> performance gain.
> >>>>>>>
> >>>>>>> before-opt
> >>>>>>> ----------
> >>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> >>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> >>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> >>>>>
> >>>>> I'm surprised that nop and push variants are much slower than ret
> >>>>> variant. This is exactly opposite on x86-64. Do you have an
> >>>>> explanation why this might be happening? I see you are trying to
> >>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> >>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
> >>>>> installed on nop (for USDT) and on function entry (which would be push
> >>>>> variant, `push %rbp` instruction).
> >>>>>
> >>>>> ret variant, for x86-64, causes one extra step to go back to user
> >>>>> space to execute original instruction out-of-line, and then trapping
> >>>>> back to kernel for running uprobe. Which is what you normally want to
> >>>>> avoid.
> >>>>>
> >>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
> >>>>> emulated implementations for nops/push or whatever equivalents for
> >>>>> ARM64 that is. Please take a look at that and see why those are slow
> >>>>> and whether you can make those into fast uprobe cases?
> >>>>
> >>>> Hi Andrii,
> >>>>
> >>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
> >>>> compared to X86 behavior. My investigation revealed that the root cause lies in
> >>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
> >>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
> >>>> to handle these instructions out-of-line in userspace upon breakpoint exception
> >>>> is handled, leading to a significant performance overhead compared to 'ret' variant,
> >>>> which is already emulated.
> >>>>
> >>>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
> >>>> 'push' variants. The benchmark results below indicates the performance gain of
> >>>> emulation is obivious.
> >>>>
> >>>> xol (1 cpus)
> >>>> ------------
> >>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>>>
> >>>> emulation (1 cpus)
> >>>> -------------------
> >>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
> >>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
> >>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
> >>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
> >>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
> >>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
> >>>>
> >>>> As you can see, the performance gap between nop/push and ret variants has been significantly
> >>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
> >>>> more cycles than the other.
> >>>
> >>> Great, it's an obvious improvement. Are you going to send patches
> >>> upstream? Please cc bpf@vger.kernel.org as well.
> >>
> >> I'll need more time to thoroughly test this patch. The emulation o push/nop
> >> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
> >> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
> >> regression.
> >
> > Why would the *benchmarks* have to be modified? The typical
> > kprobe/kretprobe attachment should be fast, and those benchmarks
> > simulate typical fast path kprobe/kretprobe. Is there some simulation
> > logic that is shared between uprobes and kprobes or something?
>
> Yes, kprobe and uprobe share many things for Arm64, but there are curical
> difference. Let me explain further. Simulating a 'push' instruction on
> arm64 will modify the stack pointer at *probe breakpoint. However, kprobe
> and uprobe use different way to restore the stack pointer upon returning
> from the breakpoint exception. Consequently.sharing the same simulation
> logic for both would result in kernel panic for kprobe.
>
> To avoid complicating the exception return logic, I've opted to simuate
> 'push' only for uprobe and maintain the single-stepping for kprobe [0].
> This trade-off avoid the impacts to kprobe/kretprobe, and no need to
> change the kprobe/kretprobe related benchmark.
>

I see, thanks for explaining. I noticed the "bool kernel" flag you
added, it makes sense.

I still don't understand why you'd need to modify kprobe/kretprobe
benchmarks, as they are testing attaching kprobe at the kernel
function entry, which for kernels should be an optimized case not
requiring any emulation.

> [0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@huawei.com/
>
> >
> >>
> >>>
> >>>
> >>> I'm also thinking we should update uprobe/uretprobe benchmarks to be
> >>> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
> >>> is still happy, slightly slower case (due to the need to emulate stack
> >>> operation) and "-ret" is meant to be the slow single-step case. We
> >>> should adjust the naming and make sure that on ARM64 we hit similar
> >>> code paths. Given you seem to know arm64 pretty well, can you please
> >>> take a look at updating bench tool for ARM64 (we can also rename
> >>> benchmarks to something a bit more generic, rather than using
> >>> instruction names)?
> >>
> >
> > [...]
>
> --
> BR
> Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 5 months ago


在 2024/8/16 0:53, Andrii Nakryiko 写道:
> On Wed, Aug 14, 2024 at 7:58 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/8/15 2:42, Andrii Nakryiko 写道:
>>> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>>>>>
>>>>>>>> Hi Andrii and Oleg.
>>>>>>>>
>>>>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>>>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>>>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Liao,
>>>>>>>
>>>>>>> As you can see there is an active work to improve uprobes, that
>>>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>>>> hold off a bit with your changes until all that lands. And then
>>>>>>> re-benchmark, as costs might shift.
>>>>>>>
>>>>>>> But also see some remarks below.
>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>>>
>>>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>>>> performance gain.
>>>>>>>>>
>>>>>>>>> before-opt
>>>>>>>>> ----------
>>>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>>>
>>>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>>>> explanation why this might be happening? I see you are trying to
>>>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>>>> variant, `push %rbp` instruction).
>>>>>>>
>>>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>>>> space to execute original instruction out-of-line, and then trapping
>>>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>>>>> avoid.
>>>>>>>
>>>>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>>>>> emulated implementations for nops/push or whatever equivalents for
>>>>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>>>>> and whether you can make those into fast uprobe cases?
>>>>>>
>>>>>> Hi Andrii,
>>>>>>
>>>>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
>>>>>> compared to X86 behavior. My investigation revealed that the root cause lies in
>>>>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
>>>>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
>>>>>> to handle these instructions out-of-line in userspace upon breakpoint exception
>>>>>> is handled, leading to a significant performance overhead compared to 'ret' variant,
>>>>>> which is already emulated.
>>>>>>
>>>>>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
>>>>>> 'push' variants. The benchmark results below indicates the performance gain of
>>>>>> emulation is obivious.
>>>>>>
>>>>>> xol (1 cpus)
>>>>>> ------------
>>>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>>
>>>>>> emulation (1 cpus)
>>>>>> -------------------
>>>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
>>>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
>>>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
>>>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
>>>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
>>>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
>>>>>>
>>>>>> As you can see, the performance gap between nop/push and ret variants has been significantly
>>>>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
>>>>>> more cycles than the other.
>>>>>
>>>>> Great, it's an obvious improvement. Are you going to send patches
>>>>> upstream? Please cc bpf@vger.kernel.org as well.
>>>>
>>>> I'll need more time to thoroughly test this patch. The emulation o push/nop
>>>> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
>>>> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
>>>> regression.
>>>
>>> Why would the *benchmarks* have to be modified? The typical
>>> kprobe/kretprobe attachment should be fast, and those benchmarks
>>> simulate typical fast path kprobe/kretprobe. Is there some simulation
>>> logic that is shared between uprobes and kprobes or something?
>>
>> Yes, kprobe and uprobe share many things for Arm64, but there are curical
>> difference. Let me explain further. Simulating a 'push' instruction on
>> arm64 will modify the stack pointer at *probe breakpoint. However, kprobe
>> and uprobe use different way to restore the stack pointer upon returning
>> from the breakpoint exception. Consequently.sharing the same simulation
>> logic for both would result in kernel panic for kprobe.
>>
>> To avoid complicating the exception return logic, I've opted to simuate
>> 'push' only for uprobe and maintain the single-stepping for kprobe [0].
>> This trade-off avoid the impacts to kprobe/kretprobe, and no need to
>> change the kprobe/kretprobe related benchmark.
>>
> 
> I see, thanks for explaining. I noticed the "bool kernel" flag you
> added, it makes sense.
> 
> I still don't understand why you'd need to modify kprobe/kretprobe
> benchmarks, as they are testing attaching kprobe at the kernel
> function entry, which for kernels should be an optimized case not
> requiring any emulation.

Sorry about the confusion. I've revised the implementation of nop/push
emulation to avoid the impacts the kprobe/kretprobe on Arm64, see the
change to arm_probe_decode_insn() in the patch [0]. As a result, no
changes to the kprobe/kretprobe benchmarks.

[0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@huawei.com/

Thanks.

> 
>> [0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@huawei.com/
>>
>>>
>>>>
>>>>>
>>>>>
>>>>> I'm also thinking we should update uprobe/uretprobe benchmarks to be
>>>>> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
>>>>> is still happy, slightly slower case (due to the need to emulate stack
>>>>> operation) and "-ret" is meant to be the slow single-step case. We
>>>>> should adjust the naming and make sure that on ARM64 we hit similar
>>>>> code paths. Given you seem to know arm64 pretty well, can you please
>>>>> take a look at updating bench tool for ARM64 (we can also rename
>>>>> benchmarks to something a bit more generic, rather than using
>>>>> instruction names)?
>>>>
>>>
>>> [...]
>>
>> --
>> BR
>> Liao, Chang
> 
> 

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 5 months ago

On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
> > On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >>
> >>
> >> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> >>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>>
> >>>> Hi Andrii and Oleg.
> >>>>
> >>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >>>> within the mailing list. Considering this interest, I've added you and other relevant
> >>>> maintainers to the CC list for broader visibility and potential collaboration.
> >>>>
> >>>
> >>> Hi Liao,
> >>>
> >>> As you can see there is an active work to improve uprobes, that
> >>> changes lifetime management of uprobes, removes a bunch of locks taken
> >>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> >>> hold off a bit with your changes until all that lands. And then
> >>> re-benchmark, as costs might shift.
> >>>
> >>> But also see some remarks below.
> >>>
> >>>> Thanks.
> >>>>
> >>>> 在 2024/7/27 17:44, Liao Chang 写道:
> >>>>> The profiling result of single-thread model of selftests bench reveals
> >>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> >>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> >>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> >>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >>>>>
> >>>>> This patch introduce struct uprobe_breakpoint to track previously
> >>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> >>>>> need for redundant insn_slot writes and subsequent expensive cache
> >>>>> flush, especially on architecture like ARM64. This patch has been tested
> >>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> >>>>> bench and Redis GET/SET benchmark result below reveal obivious
> >>>>> performance gain.
> >>>>>
> >>>>> before-opt
> >>>>> ----------
> >>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> >>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> >>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> >>>
> >>> I'm surprised that nop and push variants are much slower than ret
> >>> variant. This is exactly opposite on x86-64. Do you have an
> >>> explanation why this might be happening? I see you are trying to
> >>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> >>> variant of uprobe that normally shouldn't be used. Typically uprobe is
> >>> installed on nop (for USDT) and on function entry (which would be push
> >>> variant, `push %rbp` instruction).
> >>>
> >>> ret variant, for x86-64, causes one extra step to go back to user
> >>> space to execute original instruction out-of-line, and then trapping
> >>> back to kernel for running uprobe. Which is what you normally want to
> >>> avoid.
> >>>
> >>> What I'm getting at here. It seems like maybe arm arch is missing fast
> >>> emulated implementations for nops/push or whatever equivalents for
> >>> ARM64 that is. Please take a look at that and see why those are slow
> >>> and whether you can make those into fast uprobe cases?
> >>
> >> Hi Andrii,
> >>
> >> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
> >> compared to X86 behavior. My investigation revealed that the root cause lies in
> >> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
> >> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
> >> to handle these instructions out-of-line in userspace upon breakpoint exception
> >> is handled, leading to a significant performance overhead compared to 'ret' variant,
> >> which is already emulated.
> >>
> >> To address this issue, I've developed a patch supports  the emulation of 'nop' and
> >> 'push' variants. The benchmark results below indicates the performance gain of
> >> emulation is obivious.
> >>
> >> xol (1 cpus)
> >> ------------
> >> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>
> >> emulation (1 cpus)
> >> -------------------
> >> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
> >> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
> >> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
> >> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
> >> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
> >> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
> >>
> >> As you can see, the performance gap between nop/push and ret variants has been significantly
> >> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
> >> more cycles than the other.
> >
> > Great, it's an obvious improvement. Are you going to send patches
> > upstream? Please cc bpf@vger.kernel.org as well.
>
> I'll need more time to thoroughly test this patch. The emulation o push/nop
> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
> regression.
>
> >
> >
> > I'm also thinking we should update uprobe/uretprobe benchmarks to be
> > less x86-specific. Right now "-nop" is the happy fastest case, "-push"
> > is still happy, slightly slower case (due to the need to emulate stack
> > operation) and "-ret" is meant to be the slow single-step case. We
> > should adjust the naming and make sure that on ARM64 we hit similar
> > code paths. Given you seem to know arm64 pretty well, can you please
> > take a look at updating bench tool for ARM64 (we can also rename
> > benchmarks to something a bit more generic, rather than using
> > instruction names)?
>
> Let me use a matrix below for the structured comparsion of uprobe/uretprobe
> benchmarks on X86 and Arm64:
>
> Architecture  Instrution Type   Handling method   Performance
> X86           nop               Emulated          Fastest
> X86           push              Emulated          Fast
> X86           ret               Single-step       Slow
> Arm64         nop               Emulated          Fastest
> Arm64         push              Emulated          Fast
> Arm64         ret               Emulated          Faster
>
> I suggest categorize benchmarks into 'emu' for emulated instructions and 'ss'
> for 'single-steppable' instructions. Generally, emulated instructions should
> outperform single-step ones across different architectures. Regarding the
> generic naming, I propose using a self-explanatory style, such as
> s/nop/empty-insn/g, s/push/push-stack/g, s/ret/func-return/g.
>
> Above all, example "bench --list" output:
>
> X86:
>   ...
>   trig-uprobe-emu-empty-insn
>   trig-uprobe-ss-func-return
>   trig-uprobe-emu-push-stack
>   trig-uretprobe-emu-empyt-insn
>   trig-uretprobe-ss-func-return
>   trig-uretprobe-emu-push-stack
>   ...
>
> Arm64:
>   ...
>   trig-uprobe-emu-empty-insn
>   trig-uprobe-emu-func-return
>   trig-uprobe-emu-push-stack
>   trig-uretprobe-emu-empyt-insn
>   trig-uretprobe-emu-func-return
>   trig-uretprobe-emu-push-stack
>   ...
>
> This structure will allow for direct comparison of uprobe/uretprobe
> performance across different architectures and instruction types.
> Please let me know your thought, Andrii.

Tbh, sounds a bit like an overkill. But before we decide on naming,
what kind of situation is single-stepped on arm64?

>
> Thanks.
>
> >
> >>
> >>>
> >>>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> >>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> >>>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> >>>>> Redis SET (RPS) uprobe: 42728.52
> >>>>> Redis GET (RPS) uprobe: 43640.18
> >>>>> Redis SET (RPS) uretprobe: 40624.54
> >>>>> Redis GET (RPS) uretprobe: 41180.56
> >>>>>
> >>>>> after-opt
> >>>>> ---------
> >>>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>>>> Redis SET (RPS) uprobe: 43939.69
> >>>>> Redis GET (RPS) uprobe: 45200.80
> >>>>> Redis SET (RPS) uretprobe: 41658.58
> >>>>> Redis GET (RPS) uretprobe: 42805.80
> >>>>>
> >>>>> While some uprobes might still need to share the same insn_slot, this
> >>>>> patch compare the instructions in the resued insn_slot with the
> >>>>> instructions execute out-of-line firstly to decides allocate a new one
> >>>>> or not.
> >>>>>
> >>>>> Additionally, this patch use a rbtree associated with each thread that
> >>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> >>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
> >>>>> contention, it result in faster lookup times compared to find_uprobe().
> >>>>>
> >>>>> The other part of this patch are some necessary memory management for
> >>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> >>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
> >>>>> uprobe_breakpoints will be freed when thread exit.
> >>>>>
> >>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>>>> ---
> >>>>>  include/linux/uprobes.h |   3 +
> >>>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> >>>>>  2 files changed, 211 insertions(+), 38 deletions(-)
> >>>>>
> >>>
> >>> [...]
> >>
> >> --
> >> BR
> >> Liao, Chang
>
> --
> BR
> Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 5 months ago


在 2024/8/15 0:57, Andrii Nakryiko 写道:
> On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
>>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>>>
>>>>>> Hi Andrii and Oleg.
>>>>>>
>>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>>>
>>>>>
>>>>> Hi Liao,
>>>>>
>>>>> As you can see there is an active work to improve uprobes, that
>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>> hold off a bit with your changes until all that lands. And then
>>>>> re-benchmark, as costs might shift.
>>>>>
>>>>> But also see some remarks below.
>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>>
>>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>>> performance gain.
>>>>>>>
>>>>>>> before-opt
>>>>>>> ----------
>>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>>
>>>>> I'm surprised that nop and push variants are much slower than ret
>>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>>> explanation why this might be happening? I see you are trying to
>>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>>> installed on nop (for USDT) and on function entry (which would be push
>>>>> variant, `push %rbp` instruction).
>>>>>
>>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>>> space to execute original instruction out-of-line, and then trapping
>>>>> back to kernel for running uprobe. Which is what you normally want to
>>>>> avoid.
>>>>>
>>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>>> emulated implementations for nops/push or whatever equivalents for
>>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>>> and whether you can make those into fast uprobe cases?
>>>>
>>>> Hi Andrii,
>>>>
>>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
>>>> compared to X86 behavior. My investigation revealed that the root cause lies in
>>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
>>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
>>>> to handle these instructions out-of-line in userspace upon breakpoint exception
>>>> is handled, leading to a significant performance overhead compared to 'ret' variant,
>>>> which is already emulated.
>>>>
>>>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
>>>> 'push' variants. The benchmark results below indicates the performance gain of
>>>> emulation is obivious.
>>>>
>>>> xol (1 cpus)
>>>> ------------
>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>
>>>> emulation (1 cpus)
>>>> -------------------
>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
>>>>
>>>> As you can see, the performance gap between nop/push and ret variants has been significantly
>>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
>>>> more cycles than the other.
>>>
>>> Great, it's an obvious improvement. Are you going to send patches
>>> upstream? Please cc bpf@vger.kernel.org as well.
>>
>> I'll need more time to thoroughly test this patch. The emulation o push/nop
>> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
>> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
>> regression.
>>
>>>
>>>
>>> I'm also thinking we should update uprobe/uretprobe benchmarks to be
>>> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
>>> is still happy, slightly slower case (due to the need to emulate stack
>>> operation) and "-ret" is meant to be the slow single-step case. We
>>> should adjust the naming and make sure that on ARM64 we hit similar
>>> code paths. Given you seem to know arm64 pretty well, can you please
>>> take a look at updating bench tool for ARM64 (we can also rename
>>> benchmarks to something a bit more generic, rather than using
>>> instruction names)?
>>
>> Let me use a matrix below for the structured comparsion of uprobe/uretprobe
>> benchmarks on X86 and Arm64:
>>
>> Architecture  Instrution Type   Handling method   Performance
>> X86           nop               Emulated          Fastest
>> X86           push              Emulated          Fast
>> X86           ret               Single-step       Slow
>> Arm64         nop               Emulated          Fastest
>> Arm64         push              Emulated          Fast
>> Arm64         ret               Emulated          Faster
>>
>> I suggest categorize benchmarks into 'emu' for emulated instructions and 'ss'
>> for 'single-steppable' instructions. Generally, emulated instructions should
>> outperform single-step ones across different architectures. Regarding the
>> generic naming, I propose using a self-explanatory style, such as
>> s/nop/empty-insn/g, s/push/push-stack/g, s/ret/func-return/g.
>>
>> Above all, example "bench --list" output:
>>
>> X86:
>>   ...
>>   trig-uprobe-emu-empty-insn
>>   trig-uprobe-ss-func-return
>>   trig-uprobe-emu-push-stack
>>   trig-uretprobe-emu-empyt-insn
>>   trig-uretprobe-ss-func-return
>>   trig-uretprobe-emu-push-stack
>>   ...
>>
>> Arm64:
>>   ...
>>   trig-uprobe-emu-empty-insn
>>   trig-uprobe-emu-func-return
>>   trig-uprobe-emu-push-stack
>>   trig-uretprobe-emu-empyt-insn
>>   trig-uretprobe-emu-func-return
>>   trig-uretprobe-emu-push-stack
>>   ...
>>
>> This structure will allow for direct comparison of uprobe/uretprobe
>> performance across different architectures and instruction types.
>> Please let me know your thought, Andrii.
> 
> Tbh, sounds a bit like an overkill. But before we decide on naming,
> what kind of situation is single-stepped on arm64?

On Arm64, the following instruction types are generally not single-steppable.

  - Modifying and reading PC, including 'ret' and various branch instructions.

  - Forming a PC-relative address using the PC and an immediate value.

  - Generating exception, includes BRK, HLT, HVC, SMC, SVC.

  - Loading memory at address calculated based on the PC and an immediate offset.

  - Moving general-purpose register to system registers of PE (similar to logical cores on x86).

  - Hint instruction cause exception or unintend behavior for single-stepping.
    However, 'nop' is steppable hint.

Most parts of instructions that doesn't fall into any of these types are single-stepped,
including the Arm64 equvialents of 'push'.

Thanks.

> 
>>
>> Thanks.
>>
>>>
>>>>
>>>>>
>>>>>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>>>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>>>>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>>>>>> Redis SET (RPS) uprobe: 42728.52
>>>>>>> Redis GET (RPS) uprobe: 43640.18
>>>>>>> Redis SET (RPS) uretprobe: 40624.54
>>>>>>> Redis GET (RPS) uretprobe: 41180.56
>>>>>>>
>>>>>>> after-opt
>>>>>>> ---------
>>>>>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>>> Redis SET (RPS) uprobe: 43939.69
>>>>>>> Redis GET (RPS) uprobe: 45200.80
>>>>>>> Redis SET (RPS) uretprobe: 41658.58
>>>>>>> Redis GET (RPS) uretprobe: 42805.80
>>>>>>>
>>>>>>> While some uprobes might still need to share the same insn_slot, this
>>>>>>> patch compare the instructions in the resued insn_slot with the
>>>>>>> instructions execute out-of-line firstly to decides allocate a new one
>>>>>>> or not.
>>>>>>>
>>>>>>> Additionally, this patch use a rbtree associated with each thread that
>>>>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>>>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>>>>>> contention, it result in faster lookup times compared to find_uprobe().
>>>>>>>
>>>>>>> The other part of this patch are some necessary memory management for
>>>>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>>>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>>>>>> uprobe_breakpoints will be freed when thread exit.
>>>>>>>
>>>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>>>> ---
>>>>>>>  include/linux/uprobes.h |   3 +
>>>>>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>>>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>>>>>
>>>>>
>>>>> [...]
>>>>
>>>> --
>>>> BR
>>>> Liao, Chang
>>
>> --
>> BR
>> Liao, Chang

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 5 months ago

On Thu, Aug 15, 2024 at 12:59 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/15 0:57, Andrii Nakryiko 写道:
> > On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >>
> >>
> >> 在 2024/8/13 1:49, Andrii Nakryiko 写道:
> >>> On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> >>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>>>>
> >>>>>> Hi Andrii and Oleg.
> >>>>>>
> >>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >>>>>> within the mailing list. Considering this interest, I've added you and other relevant
> >>>>>> maintainers to the CC list for broader visibility and potential collaboration.
> >>>>>>
> >>>>>
> >>>>> Hi Liao,
> >>>>>
> >>>>> As you can see there is an active work to improve uprobes, that
> >>>>> changes lifetime management of uprobes, removes a bunch of locks taken
> >>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> >>>>> hold off a bit with your changes until all that lands. And then
> >>>>> re-benchmark, as costs might shift.
> >>>>>
> >>>>> But also see some remarks below.
> >>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
> >>>>>>> The profiling result of single-thread model of selftests bench reveals
> >>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> >>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> >>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> >>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >>>>>>>
> >>>>>>> This patch introduce struct uprobe_breakpoint to track previously
> >>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> >>>>>>> need for redundant insn_slot writes and subsequent expensive cache
> >>>>>>> flush, especially on architecture like ARM64. This patch has been tested
> >>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> >>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
> >>>>>>> performance gain.
> >>>>>>>
> >>>>>>> before-opt
> >>>>>>> ----------
> >>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> >>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> >>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> >>>>>
> >>>>> I'm surprised that nop and push variants are much slower than ret
> >>>>> variant. This is exactly opposite on x86-64. Do you have an
> >>>>> explanation why this might be happening? I see you are trying to
> >>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> >>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
> >>>>> installed on nop (for USDT) and on function entry (which would be push
> >>>>> variant, `push %rbp` instruction).
> >>>>>
> >>>>> ret variant, for x86-64, causes one extra step to go back to user
> >>>>> space to execute original instruction out-of-line, and then trapping
> >>>>> back to kernel for running uprobe. Which is what you normally want to
> >>>>> avoid.
> >>>>>
> >>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
> >>>>> emulated implementations for nops/push or whatever equivalents for
> >>>>> ARM64 that is. Please take a look at that and see why those are slow
> >>>>> and whether you can make those into fast uprobe cases?
> >>>>
> >>>> Hi Andrii,
> >>>>
> >>>> As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
> >>>> compared to X86 behavior. My investigation revealed that the root cause lies in
> >>>> the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
> >>>> of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
> >>>> to handle these instructions out-of-line in userspace upon breakpoint exception
> >>>> is handled, leading to a significant performance overhead compared to 'ret' variant,
> >>>> which is already emulated.
> >>>>
> >>>> To address this issue, I've developed a patch supports  the emulation of 'nop' and
> >>>> 'push' variants. The benchmark results below indicates the performance gain of
> >>>> emulation is obivious.
> >>>>
> >>>> xol (1 cpus)
> >>>> ------------
> >>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>>>
> >>>> emulation (1 cpus)
> >>>> -------------------
> >>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
> >>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
> >>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
> >>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
> >>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
> >>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)
> >>>>
> >>>> As you can see, the performance gap between nop/push and ret variants has been significantly
> >>>> reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
> >>>> more cycles than the other.
> >>>
> >>> Great, it's an obvious improvement. Are you going to send patches
> >>> upstream? Please cc bpf@vger.kernel.org as well.
> >>
> >> I'll need more time to thoroughly test this patch. The emulation o push/nop
> >> instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
> >> I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
> >> regression.
> >>
> >>>
> >>>
> >>> I'm also thinking we should update uprobe/uretprobe benchmarks to be
> >>> less x86-specific. Right now "-nop" is the happy fastest case, "-push"
> >>> is still happy, slightly slower case (due to the need to emulate stack
> >>> operation) and "-ret" is meant to be the slow single-step case. We
> >>> should adjust the naming and make sure that on ARM64 we hit similar
> >>> code paths. Given you seem to know arm64 pretty well, can you please
> >>> take a look at updating bench tool for ARM64 (we can also rename
> >>> benchmarks to something a bit more generic, rather than using
> >>> instruction names)?
> >>
> >> Let me use a matrix below for the structured comparsion of uprobe/uretprobe
> >> benchmarks on X86 and Arm64:
> >>
> >> Architecture  Instrution Type   Handling method   Performance
> >> X86           nop               Emulated          Fastest
> >> X86           push              Emulated          Fast
> >> X86           ret               Single-step       Slow
> >> Arm64         nop               Emulated          Fastest
> >> Arm64         push              Emulated          Fast
> >> Arm64         ret               Emulated          Faster
> >>
> >> I suggest categorize benchmarks into 'emu' for emulated instructions and 'ss'
> >> for 'single-steppable' instructions. Generally, emulated instructions should
> >> outperform single-step ones across different architectures. Regarding the
> >> generic naming, I propose using a self-explanatory style, such as
> >> s/nop/empty-insn/g, s/push/push-stack/g, s/ret/func-return/g.
> >>
> >> Above all, example "bench --list" output:
> >>
> >> X86:
> >>   ...
> >>   trig-uprobe-emu-empty-insn
> >>   trig-uprobe-ss-func-return
> >>   trig-uprobe-emu-push-stack
> >>   trig-uretprobe-emu-empyt-insn
> >>   trig-uretprobe-ss-func-return
> >>   trig-uretprobe-emu-push-stack
> >>   ...
> >>
> >> Arm64:
> >>   ...
> >>   trig-uprobe-emu-empty-insn
> >>   trig-uprobe-emu-func-return
> >>   trig-uprobe-emu-push-stack
> >>   trig-uretprobe-emu-empyt-insn
> >>   trig-uretprobe-emu-func-return
> >>   trig-uretprobe-emu-push-stack
> >>   ...
> >>
> >> This structure will allow for direct comparison of uprobe/uretprobe
> >> performance across different architectures and instruction types.
> >> Please let me know your thought, Andrii.
> >
> > Tbh, sounds a bit like an overkill. But before we decide on naming,
> > what kind of situation is single-stepped on arm64?
>
> On Arm64, the following instruction types are generally not single-steppable.
>
>   - Modifying and reading PC, including 'ret' and various branch instructions.
>
>   - Forming a PC-relative address using the PC and an immediate value.
>
>   - Generating exception, includes BRK, HLT, HVC, SMC, SVC.
>
>   - Loading memory at address calculated based on the PC and an immediate offset.
>
>   - Moving general-purpose register to system registers of PE (similar to logical cores on x86).
>
>   - Hint instruction cause exception or unintend behavior for single-stepping.
>     However, 'nop' is steppable hint.
>
> Most parts of instructions that doesn't fall into any of these types are single-stepped,
> including the Arm64 equvialents of 'push'.

Ok, so you special-cased the "push %rbp" equivalent, by any other
push-like instruction will be single-stepped, right?

So how about we make sure that we have three classes of
uprobe/uretprobe benchmarks:

  - fastest nop case, and we call it uprobe/uretprobe-usdt or keep it
as uprobe/uretprobe-nop. USDT is sort of a target case for this, so
I'm fine changing the name;
  - slightly less fast but common "push %rbp"-like case, which we can
call uprobe-entry (as in function entry case);
  - and slowest single-stepped, here the naming is a bit less clear,
so uprobe-slow or uprobe-ss (single-step, but if someone wants to read
"super slow" I'm fine with it as well ;). Or uprobe-sstep, I don't
know.

WDYT?

>
> Thanks.
>
> >
> >>
> >> Thanks.
> >>
> >>>
> >>>>
> >>>>>
> >>>>>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> >>>>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> >>>>>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> >>>>>>> Redis SET (RPS) uprobe: 42728.52
> >>>>>>> Redis GET (RPS) uprobe: 43640.18
> >>>>>>> Redis SET (RPS) uretprobe: 40624.54
> >>>>>>> Redis GET (RPS) uretprobe: 41180.56
> >>>>>>>
> >>>>>>> after-opt
> >>>>>>> ---------
> >>>>>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>>>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>>>>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>>>>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>>>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>>>>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>>>>>> Redis SET (RPS) uprobe: 43939.69
> >>>>>>> Redis GET (RPS) uprobe: 45200.80
> >>>>>>> Redis SET (RPS) uretprobe: 41658.58
> >>>>>>> Redis GET (RPS) uretprobe: 42805.80
> >>>>>>>
> >>>>>>> While some uprobes might still need to share the same insn_slot, this
> >>>>>>> patch compare the instructions in the resued insn_slot with the
> >>>>>>> instructions execute out-of-line firstly to decides allocate a new one
> >>>>>>> or not.
> >>>>>>>
> >>>>>>> Additionally, this patch use a rbtree associated with each thread that
> >>>>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> >>>>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
> >>>>>>> contention, it result in faster lookup times compared to find_uprobe().
> >>>>>>>
> >>>>>>> The other part of this patch are some necessary memory management for
> >>>>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> >>>>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
> >>>>>>> uprobe_breakpoints will be freed when thread exit.
> >>>>>>>
> >>>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>>>>>> ---
> >>>>>>>  include/linux/uprobes.h |   3 +
> >>>>>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> >>>>>>>  2 files changed, 211 insertions(+), 38 deletions(-)
> >>>>>>>
> >>>>>
> >>>>> [...]
> >>>>
> >>>> --
> >>>> BR
> >>>> Liao, Chang
> >>
> >> --
> >> BR
> >> Liao, Chang
>
> --
> BR
> Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 6 months ago


在 2024/8/9 2:26, Andrii Nakryiko 写道:
> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>> Hi Andrii and Oleg.
>>
>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>> within the mailing list. Considering this interest, I've added you and other relevant
>> maintainers to the CC list for broader visibility and potential collaboration.
>>
> 
> Hi Liao,
> 
> As you can see there is an active work to improve uprobes, that
> changes lifetime management of uprobes, removes a bunch of locks taken
> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> hold off a bit with your changes until all that lands. And then
> re-benchmark, as costs might shift.

Andrii, I'm trying to integrate your lockless changes into the upstream
next-20240806 kernel tree. And I ran into some conflicts. please let me
know which kernel you're currently working on.

Thanks.

> 
> But also see some remarks below.
> 
>> Thanks.
>>
>> 在 2024/7/27 17:44, Liao Chang 写道:
>>> The profiling result of single-thread model of selftests bench reveals
>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>
>>> This patch introduce struct uprobe_breakpoint to track previously
>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>> need for redundant insn_slot writes and subsequent expensive cache
>>> flush, especially on architecture like ARM64. This patch has been tested
>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>> bench and Redis GET/SET benchmark result below reveal obivious
>>> performance gain.
>>>
>>> before-opt
>>> ----------
>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> 
> I'm surprised that nop and push variants are much slower than ret
> variant. This is exactly opposite on x86-64. Do you have an
> explanation why this might be happening? I see you are trying to
> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> variant of uprobe that normally shouldn't be used. Typically uprobe is
> installed on nop (for USDT) and on function entry (which would be push
> variant, `push %rbp` instruction).
> 
> ret variant, for x86-64, causes one extra step to go back to user
> space to execute original instruction out-of-line, and then trapping
> back to kernel for running uprobe. Which is what you normally want to
> avoid.
> 
> What I'm getting at here. It seems like maybe arm arch is missing fast
> emulated implementations for nops/push or whatever equivalents for
> ARM64 that is. Please take a look at that and see why those are slow
> and whether you can make those into fast uprobe cases?

I will spend the weekend figuring out the questions you raised. Thanks for
pointing them out.

> 
>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>> Redis SET (RPS) uprobe: 42728.52
>>> Redis GET (RPS) uprobe: 43640.18
>>> Redis SET (RPS) uretprobe: 40624.54
>>> Redis GET (RPS) uretprobe: 41180.56
>>>
>>> after-opt
>>> ---------
>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>> Redis SET (RPS) uprobe: 43939.69
>>> Redis GET (RPS) uprobe: 45200.80
>>> Redis SET (RPS) uretprobe: 41658.58
>>> Redis GET (RPS) uretprobe: 42805.80
>>>
>>> While some uprobes might still need to share the same insn_slot, this
>>> patch compare the instructions in the resued insn_slot with the
>>> instructions execute out-of-line firstly to decides allocate a new one
>>> or not.
>>>
>>> Additionally, this patch use a rbtree associated with each thread that
>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>> contention, it result in faster lookup times compared to find_uprobe().
>>>
>>> The other part of this patch are some necessary memory management for
>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>> uprobe_breakpoints will be freed when thread exit.
>>>
>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>> ---
>>>  include/linux/uprobes.h |   3 +
>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>
> 
> [...]

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 6 months ago

On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> > On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >> Hi Andrii and Oleg.
> >>
> >> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >> within the mailing list. Considering this interest, I've added you and other relevant
> >> maintainers to the CC list for broader visibility and potential collaboration.
> >>
> >
> > Hi Liao,
> >
> > As you can see there is an active work to improve uprobes, that
> > changes lifetime management of uprobes, removes a bunch of locks taken
> > in the uprobe/uretprobe hot path, etc. It would be nice if you can
> > hold off a bit with your changes until all that lands. And then
> > re-benchmark, as costs might shift.
>
> Andrii, I'm trying to integrate your lockless changes into the upstream
> next-20240806 kernel tree. And I ran into some conflicts. please let me
> know which kernel you're currently working on.
>

My patches are  based on tip/perf/core. But I also just pushed all the
changes I have accumulated (including patches I haven't sent for
review just yet), plus your patches for sighand lock removed applied
on top into [0]. So you can take a look and use that as a base for
now. Keep in mind, a bunch of those patches might still change, but
this should give you the best currently achievable performance with
uprobes/uretprobes. E.g., I'm getting something like below on x86-64
(note somewhat linear scalability with number of CPU cores, with
per-CPU performance *slowly* declining):

uprobe-nop            ( 1 cpus):    3.565 ± 0.004M/s  (  3.565M/s/cpu)
uprobe-nop            ( 2 cpus):    6.742 ± 0.009M/s  (  3.371M/s/cpu)
uprobe-nop            ( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
uprobe-nop            ( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
uprobe-nop            ( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
uprobe-nop            ( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
uprobe-nop            ( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
uprobe-nop            ( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
uprobe-nop            (10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
uprobe-nop            (12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
uprobe-nop            (14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
uprobe-nop            (16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
uprobe-nop            (24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
uprobe-nop            (32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
uprobe-nop            (40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
uprobe-nop            (48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
uprobe-nop            (56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
uprobe-nop            (64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
uprobe-nop            (72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
uprobe-nop            (80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)

uretprobe-nop         ( 1 cpus):    1.987 ± 0.001M/s  (  1.987M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.766 ± 0.003M/s  (  1.883M/s/cpu)
uretprobe-nop         ( 3 cpus):    5.638 ± 0.002M/s  (  1.879M/s/cpu)
uretprobe-nop         ( 4 cpus):    7.275 ± 0.003M/s  (  1.819M/s/cpu)
uretprobe-nop         ( 5 cpus):    9.124 ± 0.004M/s  (  1.825M/s/cpu)
uretprobe-nop         ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
uretprobe-nop         ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
uretprobe-nop         ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
uretprobe-nop         (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
uretprobe-nop         (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
uretprobe-nop         (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
uretprobe-nop         (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
uretprobe-nop         (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/cpu)
uretprobe-nop         (32 cpus):   44.869 ± 0.153M/s  (  1.402M/s/cpu)
uretprobe-nop         (40 cpus):   53.397 ± 0.220M/s  (  1.335M/s/cpu)
uretprobe-nop         (48 cpus):   48.903 ± 2.277M/s  (  1.019M/s/cpu)
uretprobe-nop         (56 cpus):   42.144 ± 1.206M/s  (  0.753M/s/cpu)
uretprobe-nop         (64 cpus):   42.656 ± 1.104M/s  (  0.666M/s/cpu)
uretprobe-nop         (72 cpus):   46.299 ± 1.443M/s  (  0.643M/s/cpu)
uretprobe-nop         (80 cpus):   46.469 ± 0.808M/s  (  0.581M/s/cpu)

uprobe-ret            ( 1 cpus):    1.219 ± 0.008M/s  (  1.219M/s/cpu)
uprobe-ret            ( 2 cpus):    1.862 ± 0.008M/s  (  0.931M/s/cpu)
uprobe-ret            ( 3 cpus):    2.874 ± 0.005M/s  (  0.958M/s/cpu)
uprobe-ret            ( 4 cpus):    3.512 ± 0.002M/s  (  0.878M/s/cpu)
uprobe-ret            ( 5 cpus):    3.549 ± 0.001M/s  (  0.710M/s/cpu)
uprobe-ret            ( 6 cpus):    3.425 ± 0.003M/s  (  0.571M/s/cpu)
uprobe-ret            ( 7 cpus):    3.551 ± 0.009M/s  (  0.507M/s/cpu)
uprobe-ret            ( 8 cpus):    3.050 ± 0.002M/s  (  0.381M/s/cpu)
uprobe-ret            (10 cpus):    2.706 ± 0.002M/s  (  0.271M/s/cpu)
uprobe-ret            (12 cpus):    2.588 ± 0.003M/s  (  0.216M/s/cpu)
uprobe-ret            (14 cpus):    2.589 ± 0.003M/s  (  0.185M/s/cpu)
uprobe-ret            (16 cpus):    2.575 ± 0.001M/s  (  0.161M/s/cpu)
uprobe-ret            (24 cpus):    1.808 ± 0.011M/s  (  0.075M/s/cpu)
uprobe-ret            (32 cpus):    1.853 ± 0.001M/s  (  0.058M/s/cpu)
uprobe-ret            (40 cpus):    1.952 ± 0.002M/s  (  0.049M/s/cpu)
uprobe-ret            (48 cpus):    2.075 ± 0.007M/s  (  0.043M/s/cpu)
uprobe-ret            (56 cpus):    2.441 ± 0.004M/s  (  0.044M/s/cpu)
uprobe-ret            (64 cpus):    1.880 ± 0.012M/s  (  0.029M/s/cpu)
uprobe-ret            (72 cpus):    0.962 ± 0.002M/s  (  0.013M/s/cpu)
uprobe-ret            (80 cpus):    1.040 ± 0.011M/s  (  0.013M/s/cpu)

uretprobe-ret         ( 1 cpus):    0.981 ± 0.000M/s  (  0.981M/s/cpu)
uretprobe-ret         ( 2 cpus):    1.421 ± 0.001M/s  (  0.711M/s/cpu)
uretprobe-ret         ( 3 cpus):    2.050 ± 0.003M/s  (  0.683M/s/cpu)
uretprobe-ret         ( 4 cpus):    2.596 ± 0.002M/s  (  0.649M/s/cpu)
uretprobe-ret         ( 5 cpus):    3.105 ± 0.003M/s  (  0.621M/s/cpu)
uretprobe-ret         ( 6 cpus):    3.886 ± 0.002M/s  (  0.648M/s/cpu)
uretprobe-ret         ( 7 cpus):    3.016 ± 0.001M/s  (  0.431M/s/cpu)
uretprobe-ret         ( 8 cpus):    2.903 ± 0.000M/s  (  0.363M/s/cpu)
uretprobe-ret         (10 cpus):    2.755 ± 0.001M/s  (  0.276M/s/cpu)
uretprobe-ret         (12 cpus):    2.400 ± 0.001M/s  (  0.200M/s/cpu)
uretprobe-ret         (14 cpus):    3.972 ± 0.001M/s  (  0.284M/s/cpu)
uretprobe-ret         (16 cpus):    3.940 ± 0.003M/s  (  0.246M/s/cpu)
uretprobe-ret         (24 cpus):    3.002 ± 0.003M/s  (  0.125M/s/cpu)
uretprobe-ret         (32 cpus):    3.018 ± 0.003M/s  (  0.094M/s/cpu)
uretprobe-ret         (40 cpus):    1.846 ± 0.000M/s  (  0.046M/s/cpu)
uretprobe-ret         (48 cpus):    2.487 ± 0.004M/s  (  0.052M/s/cpu)
uretprobe-ret         (56 cpus):    2.470 ± 0.006M/s  (  0.044M/s/cpu)
uretprobe-ret         (64 cpus):    2.027 ± 0.014M/s  (  0.032M/s/cpu)
uretprobe-ret         (72 cpus):    1.108 ± 0.011M/s  (  0.015M/s/cpu)
uretprobe-ret         (80 cpus):    0.982 ± 0.005M/s  (  0.012M/s/cpu)


-ret variants (single-stepping case for x86-64) still suck, but they
suck 2x less now with your patches :) Clearly more work ahead for
those, though.


  [0] https://github.com/anakryiko/linux/commits/uprobes-lockless-cumulative/


> Thanks.
>
> >
> > But also see some remarks below.
> >
> >> Thanks.
> >>
> >> 在 2024/7/27 17:44, Liao Chang 写道:
> >>> The profiling result of single-thread model of selftests bench reveals
> >>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> >>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> >>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> >>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> >>>
> >>> This patch introduce struct uprobe_breakpoint to track previously
> >>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> >>> need for redundant insn_slot writes and subsequent expensive cache
> >>> flush, especially on architecture like ARM64. This patch has been tested
> >>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> >>> bench and Redis GET/SET benchmark result below reveal obivious
> >>> performance gain.
> >>>
> >>> before-opt
> >>> ----------
> >>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> >>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> >>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> >
> > I'm surprised that nop and push variants are much slower than ret
> > variant. This is exactly opposite on x86-64. Do you have an
> > explanation why this might be happening? I see you are trying to
> > optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> > variant of uprobe that normally shouldn't be used. Typically uprobe is
> > installed on nop (for USDT) and on function entry (which would be push
> > variant, `push %rbp` instruction).
> >
> > ret variant, for x86-64, causes one extra step to go back to user
> > space to execute original instruction out-of-line, and then trapping
> > back to kernel for running uprobe. Which is what you normally want to
> > avoid.
> >
> > What I'm getting at here. It seems like maybe arm arch is missing fast
> > emulated implementations for nops/push or whatever equivalents for
> > ARM64 that is. Please take a look at that and see why those are slow
> > and whether you can make those into fast uprobe cases?
>
> I will spend the weekend figuring out the questions you raised. Thanks for
> pointing them out.
>
> >
> >>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> >>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> >>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> >>> Redis SET (RPS) uprobe: 42728.52
> >>> Redis GET (RPS) uprobe: 43640.18
> >>> Redis SET (RPS) uretprobe: 40624.54
> >>> Redis GET (RPS) uretprobe: 41180.56
> >>>
> >>> after-opt
> >>> ---------
> >>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>> Redis SET (RPS) uprobe: 43939.69
> >>> Redis GET (RPS) uprobe: 45200.80
> >>> Redis SET (RPS) uretprobe: 41658.58
> >>> Redis GET (RPS) uretprobe: 42805.80
> >>>
> >>> While some uprobes might still need to share the same insn_slot, this
> >>> patch compare the instructions in the resued insn_slot with the
> >>> instructions execute out-of-line firstly to decides allocate a new one
> >>> or not.
> >>>
> >>> Additionally, this patch use a rbtree associated with each thread that
> >>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> >>> rbtree of uprobe_breakpoints has smaller node, better locality and less
> >>> contention, it result in faster lookup times compared to find_uprobe().
> >>>
> >>> The other part of this patch are some necessary memory management for
> >>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> >>> hit uprobe that doesn't already have a corresponding node in rbtree. All
> >>> uprobe_breakpoints will be freed when thread exit.
> >>>
> >>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>> ---
> >>>  include/linux/uprobes.h |   3 +
> >>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> >>>  2 files changed, 211 insertions(+), 38 deletions(-)
> >>>
> >
> > [...]
>
> --
> BR
> Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 6 months ago

On Fri, Aug 9, 2024 at 11:34 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >
> >
> >
> > 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> > > On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> > >>
> > >> Hi Andrii and Oleg.
> > >>
> > >> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> > >> on arm64. I notice recent discussions on the performance and scalability of uprobes
> > >> within the mailing list. Considering this interest, I've added you and other relevant
> > >> maintainers to the CC list for broader visibility and potential collaboration.
> > >>
> > >
> > > Hi Liao,
> > >
> > > As you can see there is an active work to improve uprobes, that
> > > changes lifetime management of uprobes, removes a bunch of locks taken
> > > in the uprobe/uretprobe hot path, etc. It would be nice if you can
> > > hold off a bit with your changes until all that lands. And then
> > > re-benchmark, as costs might shift.
> >
> > Andrii, I'm trying to integrate your lockless changes into the upstream
> > next-20240806 kernel tree. And I ran into some conflicts. please let me
> > know which kernel you're currently working on.
> >
>
> My patches are  based on tip/perf/core. But I also just pushed all the
> changes I have accumulated (including patches I haven't sent for
> review just yet), plus your patches for sighand lock removed applied
> on top into [0]. So you can take a look and use that as a base for
> now. Keep in mind, a bunch of those patches might still change, but
> this should give you the best currently achievable performance with
> uprobes/uretprobes. E.g., I'm getting something like below on x86-64
> (note somewhat linear scalability with number of CPU cores, with
> per-CPU performance *slowly* declining):
>
> uprobe-nop            ( 1 cpus):    3.565 ± 0.004M/s  (  3.565M/s/cpu)
> uprobe-nop            ( 2 cpus):    6.742 ± 0.009M/s  (  3.371M/s/cpu)
> uprobe-nop            ( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
> uprobe-nop            ( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
> uprobe-nop            ( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
> uprobe-nop            ( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
> uprobe-nop            ( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
> uprobe-nop            ( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
> uprobe-nop            (10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
> uprobe-nop            (12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
> uprobe-nop            (14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
> uprobe-nop            (16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
> uprobe-nop            (24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
> uprobe-nop            (32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
> uprobe-nop            (40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
> uprobe-nop            (48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
> uprobe-nop            (56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
> uprobe-nop            (64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
> uprobe-nop            (72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
> uprobe-nop            (80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)
>
> uretprobe-nop         ( 1 cpus):    1.987 ± 0.001M/s  (  1.987M/s/cpu)
> uretprobe-nop         ( 2 cpus):    3.766 ± 0.003M/s  (  1.883M/s/cpu)
> uretprobe-nop         ( 3 cpus):    5.638 ± 0.002M/s  (  1.879M/s/cpu)
> uretprobe-nop         ( 4 cpus):    7.275 ± 0.003M/s  (  1.819M/s/cpu)
> uretprobe-nop         ( 5 cpus):    9.124 ± 0.004M/s  (  1.825M/s/cpu)
> uretprobe-nop         ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
> uretprobe-nop         ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
> uretprobe-nop         ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
> uretprobe-nop         (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
> uretprobe-nop         (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
> uretprobe-nop         (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
> uretprobe-nop         (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
> uretprobe-nop         (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/cpu)
> uretprobe-nop         (32 cpus):   44.869 ± 0.153M/s  (  1.402M/s/cpu)
> uretprobe-nop         (40 cpus):   53.397 ± 0.220M/s  (  1.335M/s/cpu)
> uretprobe-nop         (48 cpus):   48.903 ± 2.277M/s  (  1.019M/s/cpu)
> uretprobe-nop         (56 cpus):   42.144 ± 1.206M/s  (  0.753M/s/cpu)
> uretprobe-nop         (64 cpus):   42.656 ± 1.104M/s  (  0.666M/s/cpu)
> uretprobe-nop         (72 cpus):   46.299 ± 1.443M/s  (  0.643M/s/cpu)
> uretprobe-nop         (80 cpus):   46.469 ± 0.808M/s  (  0.581M/s/cpu)
>
> uprobe-ret            ( 1 cpus):    1.219 ± 0.008M/s  (  1.219M/s/cpu)
> uprobe-ret            ( 2 cpus):    1.862 ± 0.008M/s  (  0.931M/s/cpu)
> uprobe-ret            ( 3 cpus):    2.874 ± 0.005M/s  (  0.958M/s/cpu)
> uprobe-ret            ( 4 cpus):    3.512 ± 0.002M/s  (  0.878M/s/cpu)
> uprobe-ret            ( 5 cpus):    3.549 ± 0.001M/s  (  0.710M/s/cpu)
> uprobe-ret            ( 6 cpus):    3.425 ± 0.003M/s  (  0.571M/s/cpu)
> uprobe-ret            ( 7 cpus):    3.551 ± 0.009M/s  (  0.507M/s/cpu)
> uprobe-ret            ( 8 cpus):    3.050 ± 0.002M/s  (  0.381M/s/cpu)
> uprobe-ret            (10 cpus):    2.706 ± 0.002M/s  (  0.271M/s/cpu)
> uprobe-ret            (12 cpus):    2.588 ± 0.003M/s  (  0.216M/s/cpu)
> uprobe-ret            (14 cpus):    2.589 ± 0.003M/s  (  0.185M/s/cpu)
> uprobe-ret            (16 cpus):    2.575 ± 0.001M/s  (  0.161M/s/cpu)
> uprobe-ret            (24 cpus):    1.808 ± 0.011M/s  (  0.075M/s/cpu)
> uprobe-ret            (32 cpus):    1.853 ± 0.001M/s  (  0.058M/s/cpu)
> uprobe-ret            (40 cpus):    1.952 ± 0.002M/s  (  0.049M/s/cpu)
> uprobe-ret            (48 cpus):    2.075 ± 0.007M/s  (  0.043M/s/cpu)
> uprobe-ret            (56 cpus):    2.441 ± 0.004M/s  (  0.044M/s/cpu)
> uprobe-ret            (64 cpus):    1.880 ± 0.012M/s  (  0.029M/s/cpu)
> uprobe-ret            (72 cpus):    0.962 ± 0.002M/s  (  0.013M/s/cpu)
> uprobe-ret            (80 cpus):    1.040 ± 0.011M/s  (  0.013M/s/cpu)
>
> uretprobe-ret         ( 1 cpus):    0.981 ± 0.000M/s  (  0.981M/s/cpu)
> uretprobe-ret         ( 2 cpus):    1.421 ± 0.001M/s  (  0.711M/s/cpu)
> uretprobe-ret         ( 3 cpus):    2.050 ± 0.003M/s  (  0.683M/s/cpu)
> uretprobe-ret         ( 4 cpus):    2.596 ± 0.002M/s  (  0.649M/s/cpu)
> uretprobe-ret         ( 5 cpus):    3.105 ± 0.003M/s  (  0.621M/s/cpu)
> uretprobe-ret         ( 6 cpus):    3.886 ± 0.002M/s  (  0.648M/s/cpu)
> uretprobe-ret         ( 7 cpus):    3.016 ± 0.001M/s  (  0.431M/s/cpu)
> uretprobe-ret         ( 8 cpus):    2.903 ± 0.000M/s  (  0.363M/s/cpu)
> uretprobe-ret         (10 cpus):    2.755 ± 0.001M/s  (  0.276M/s/cpu)
> uretprobe-ret         (12 cpus):    2.400 ± 0.001M/s  (  0.200M/s/cpu)
> uretprobe-ret         (14 cpus):    3.972 ± 0.001M/s  (  0.284M/s/cpu)
> uretprobe-ret         (16 cpus):    3.940 ± 0.003M/s  (  0.246M/s/cpu)
> uretprobe-ret         (24 cpus):    3.002 ± 0.003M/s  (  0.125M/s/cpu)
> uretprobe-ret         (32 cpus):    3.018 ± 0.003M/s  (  0.094M/s/cpu)
> uretprobe-ret         (40 cpus):    1.846 ± 0.000M/s  (  0.046M/s/cpu)
> uretprobe-ret         (48 cpus):    2.487 ± 0.004M/s  (  0.052M/s/cpu)
> uretprobe-ret         (56 cpus):    2.470 ± 0.006M/s  (  0.044M/s/cpu)
> uretprobe-ret         (64 cpus):    2.027 ± 0.014M/s  (  0.032M/s/cpu)
> uretprobe-ret         (72 cpus):    1.108 ± 0.011M/s  (  0.015M/s/cpu)
> uretprobe-ret         (80 cpus):    0.982 ± 0.005M/s  (  0.012M/s/cpu)
>
>
> -ret variants (single-stepping case for x86-64) still suck, but they
> suck 2x less now with your patches :) Clearly more work ahead for
> those, though.
>

Quick profiling shows that it's mostly xol_take_insn_slot() and
xol_free_insn_slot(), now. So it seems like your planned work might
help here.

>
>   [0] https://github.com/anakryiko/linux/commits/uprobes-lockless-cumulative/
>
>
> > Thanks.
> >
> > >
> > > But also see some remarks below.
> > >
> > >> Thanks.
> > >>
> > >> 在 2024/7/27 17:44, Liao Chang 写道:
> > >>> The profiling result of single-thread model of selftests bench reveals
> > >>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
> > >>> ARM64. On my local testing machine, 5% of CPU time is consumed by
> > >>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
> > >>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
> > >>>
> > >>> This patch introduce struct uprobe_breakpoint to track previously
> > >>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
> > >>> need for redundant insn_slot writes and subsequent expensive cache
> > >>> flush, especially on architecture like ARM64. This patch has been tested
> > >>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
> > >>> bench and Redis GET/SET benchmark result below reveal obivious
> > >>> performance gain.
> > >>>
> > >>> before-opt
> > >>> ----------
> > >>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
> > >>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
> > >>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
> > >
> > > I'm surprised that nop and push variants are much slower than ret
> > > variant. This is exactly opposite on x86-64. Do you have an
> > > explanation why this might be happening? I see you are trying to
> > > optimize xol_get_insn_slot(), but that is (at least for x86) a slow
> > > variant of uprobe that normally shouldn't be used. Typically uprobe is
> > > installed on nop (for USDT) and on function entry (which would be push
> > > variant, `push %rbp` instruction).
> > >
> > > ret variant, for x86-64, causes one extra step to go back to user
> > > space to execute original instruction out-of-line, and then trapping
> > > back to kernel for running uprobe. Which is what you normally want to
> > > avoid.
> > >
> > > What I'm getting at here. It seems like maybe arm arch is missing fast
> > > emulated implementations for nops/push or whatever equivalents for
> > > ARM64 that is. Please take a look at that and see why those are slow
> > > and whether you can make those into fast uprobe cases?
> >
> > I will spend the weekend figuring out the questions you raised. Thanks for
> > pointing them out.
> >
> > >
> > >>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
> > >>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
> > >>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
> > >>> Redis SET (RPS) uprobe: 42728.52
> > >>> Redis GET (RPS) uprobe: 43640.18
> > >>> Redis SET (RPS) uretprobe: 40624.54
> > >>> Redis GET (RPS) uretprobe: 41180.56
> > >>>
> > >>> after-opt
> > >>> ---------
> > >>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> > >>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> > >>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> > >>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> > >>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> > >>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> > >>> Redis SET (RPS) uprobe: 43939.69
> > >>> Redis GET (RPS) uprobe: 45200.80
> > >>> Redis SET (RPS) uretprobe: 41658.58
> > >>> Redis GET (RPS) uretprobe: 42805.80
> > >>>
> > >>> While some uprobes might still need to share the same insn_slot, this
> > >>> patch compare the instructions in the resued insn_slot with the
> > >>> instructions execute out-of-line firstly to decides allocate a new one
> > >>> or not.
> > >>>
> > >>> Additionally, this patch use a rbtree associated with each thread that
> > >>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
> > >>> rbtree of uprobe_breakpoints has smaller node, better locality and less
> > >>> contention, it result in faster lookup times compared to find_uprobe().
> > >>>
> > >>> The other part of this patch are some necessary memory management for
> > >>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
> > >>> hit uprobe that doesn't already have a corresponding node in rbtree. All
> > >>> uprobe_breakpoints will be freed when thread exit.
> > >>>
> > >>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> > >>> ---
> > >>>  include/linux/uprobes.h |   3 +
> > >>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
> > >>>  2 files changed, 211 insertions(+), 38 deletions(-)
> > >>>
> > >
> > > [...]
> >
> > --
> > BR
> > Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 5 months ago


在 2024/8/10 2:40, Andrii Nakryiko 写道:
> On Fri, Aug 9, 2024 at 11:34 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
>>
>> On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>
>>>
>>>
>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>>
>>>>> Hi Andrii and Oleg.
>>>>>
>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>>
>>>>
>>>> Hi Liao,
>>>>
>>>> As you can see there is an active work to improve uprobes, that
>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>> hold off a bit with your changes until all that lands. And then
>>>> re-benchmark, as costs might shift.
>>>
>>> Andrii, I'm trying to integrate your lockless changes into the upstream
>>> next-20240806 kernel tree. And I ran into some conflicts. please let me
>>> know which kernel you're currently working on.
>>>
>>
>> My patches are  based on tip/perf/core. But I also just pushed all the
>> changes I have accumulated (including patches I haven't sent for
>> review just yet), plus your patches for sighand lock removed applied
>> on top into [0]. So you can take a look and use that as a base for
>> now. Keep in mind, a bunch of those patches might still change, but
>> this should give you the best currently achievable performance with
>> uprobes/uretprobes. E.g., I'm getting something like below on x86-64
>> (note somewhat linear scalability with number of CPU cores, with
>> per-CPU performance *slowly* declining):
>>
>> uprobe-nop            ( 1 cpus):    3.565 ± 0.004M/s  (  3.565M/s/cpu)
>> uprobe-nop            ( 2 cpus):    6.742 ± 0.009M/s  (  3.371M/s/cpu)
>> uprobe-nop            ( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
>> uprobe-nop            ( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
>> uprobe-nop            ( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
>> uprobe-nop            ( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
>> uprobe-nop            ( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
>> uprobe-nop            ( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
>> uprobe-nop            (10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
>> uprobe-nop            (12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
>> uprobe-nop            (14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
>> uprobe-nop            (16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
>> uprobe-nop            (24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
>> uprobe-nop            (32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
>> uprobe-nop            (40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
>> uprobe-nop            (48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
>> uprobe-nop            (56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
>> uprobe-nop            (64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
>> uprobe-nop            (72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
>> uprobe-nop            (80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)
>>
>> uretprobe-nop         ( 1 cpus):    1.987 ± 0.001M/s  (  1.987M/s/cpu)
>> uretprobe-nop         ( 2 cpus):    3.766 ± 0.003M/s  (  1.883M/s/cpu)
>> uretprobe-nop         ( 3 cpus):    5.638 ± 0.002M/s  (  1.879M/s/cpu)
>> uretprobe-nop         ( 4 cpus):    7.275 ± 0.003M/s  (  1.819M/s/cpu)
>> uretprobe-nop         ( 5 cpus):    9.124 ± 0.004M/s  (  1.825M/s/cpu)
>> uretprobe-nop         ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
>> uretprobe-nop         ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
>> uretprobe-nop         ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
>> uretprobe-nop         (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
>> uretprobe-nop         (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
>> uretprobe-nop         (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
>> uretprobe-nop         (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
>> uretprobe-nop         (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/cpu)
>> uretprobe-nop         (32 cpus):   44.869 ± 0.153M/s  (  1.402M/s/cpu)
>> uretprobe-nop         (40 cpus):   53.397 ± 0.220M/s  (  1.335M/s/cpu)
>> uretprobe-nop         (48 cpus):   48.903 ± 2.277M/s  (  1.019M/s/cpu)
>> uretprobe-nop         (56 cpus):   42.144 ± 1.206M/s  (  0.753M/s/cpu)
>> uretprobe-nop         (64 cpus):   42.656 ± 1.104M/s  (  0.666M/s/cpu)
>> uretprobe-nop         (72 cpus):   46.299 ± 1.443M/s  (  0.643M/s/cpu)
>> uretprobe-nop         (80 cpus):   46.469 ± 0.808M/s  (  0.581M/s/cpu)
>>
>> uprobe-ret            ( 1 cpus):    1.219 ± 0.008M/s  (  1.219M/s/cpu)
>> uprobe-ret            ( 2 cpus):    1.862 ± 0.008M/s  (  0.931M/s/cpu)
>> uprobe-ret            ( 3 cpus):    2.874 ± 0.005M/s  (  0.958M/s/cpu)
>> uprobe-ret            ( 4 cpus):    3.512 ± 0.002M/s  (  0.878M/s/cpu)
>> uprobe-ret            ( 5 cpus):    3.549 ± 0.001M/s  (  0.710M/s/cpu)
>> uprobe-ret            ( 6 cpus):    3.425 ± 0.003M/s  (  0.571M/s/cpu)
>> uprobe-ret            ( 7 cpus):    3.551 ± 0.009M/s  (  0.507M/s/cpu)
>> uprobe-ret            ( 8 cpus):    3.050 ± 0.002M/s  (  0.381M/s/cpu)
>> uprobe-ret            (10 cpus):    2.706 ± 0.002M/s  (  0.271M/s/cpu)
>> uprobe-ret            (12 cpus):    2.588 ± 0.003M/s  (  0.216M/s/cpu)
>> uprobe-ret            (14 cpus):    2.589 ± 0.003M/s  (  0.185M/s/cpu)
>> uprobe-ret            (16 cpus):    2.575 ± 0.001M/s  (  0.161M/s/cpu)
>> uprobe-ret            (24 cpus):    1.808 ± 0.011M/s  (  0.075M/s/cpu)
>> uprobe-ret            (32 cpus):    1.853 ± 0.001M/s  (  0.058M/s/cpu)
>> uprobe-ret            (40 cpus):    1.952 ± 0.002M/s  (  0.049M/s/cpu)
>> uprobe-ret            (48 cpus):    2.075 ± 0.007M/s  (  0.043M/s/cpu)
>> uprobe-ret            (56 cpus):    2.441 ± 0.004M/s  (  0.044M/s/cpu)
>> uprobe-ret            (64 cpus):    1.880 ± 0.012M/s  (  0.029M/s/cpu)
>> uprobe-ret            (72 cpus):    0.962 ± 0.002M/s  (  0.013M/s/cpu)
>> uprobe-ret            (80 cpus):    1.040 ± 0.011M/s  (  0.013M/s/cpu)
>>
>> uretprobe-ret         ( 1 cpus):    0.981 ± 0.000M/s  (  0.981M/s/cpu)
>> uretprobe-ret         ( 2 cpus):    1.421 ± 0.001M/s  (  0.711M/s/cpu)
>> uretprobe-ret         ( 3 cpus):    2.050 ± 0.003M/s  (  0.683M/s/cpu)
>> uretprobe-ret         ( 4 cpus):    2.596 ± 0.002M/s  (  0.649M/s/cpu)
>> uretprobe-ret         ( 5 cpus):    3.105 ± 0.003M/s  (  0.621M/s/cpu)
>> uretprobe-ret         ( 6 cpus):    3.886 ± 0.002M/s  (  0.648M/s/cpu)
>> uretprobe-ret         ( 7 cpus):    3.016 ± 0.001M/s  (  0.431M/s/cpu)
>> uretprobe-ret         ( 8 cpus):    2.903 ± 0.000M/s  (  0.363M/s/cpu)
>> uretprobe-ret         (10 cpus):    2.755 ± 0.001M/s  (  0.276M/s/cpu)
>> uretprobe-ret         (12 cpus):    2.400 ± 0.001M/s  (  0.200M/s/cpu)
>> uretprobe-ret         (14 cpus):    3.972 ± 0.001M/s  (  0.284M/s/cpu)
>> uretprobe-ret         (16 cpus):    3.940 ± 0.003M/s  (  0.246M/s/cpu)
>> uretprobe-ret         (24 cpus):    3.002 ± 0.003M/s  (  0.125M/s/cpu)
>> uretprobe-ret         (32 cpus):    3.018 ± 0.003M/s  (  0.094M/s/cpu)
>> uretprobe-ret         (40 cpus):    1.846 ± 0.000M/s  (  0.046M/s/cpu)
>> uretprobe-ret         (48 cpus):    2.487 ± 0.004M/s  (  0.052M/s/cpu)
>> uretprobe-ret         (56 cpus):    2.470 ± 0.006M/s  (  0.044M/s/cpu)
>> uretprobe-ret         (64 cpus):    2.027 ± 0.014M/s  (  0.032M/s/cpu)
>> uretprobe-ret         (72 cpus):    1.108 ± 0.011M/s  (  0.015M/s/cpu)
>> uretprobe-ret         (80 cpus):    0.982 ± 0.005M/s  (  0.012M/s/cpu)
>>
>>
>> -ret variants (single-stepping case for x86-64) still suck, but they
>> suck 2x less now with your patches :) Clearly more work ahead for
>> those, though.
>>
> 
> Quick profiling shows that it's mostly xol_take_insn_slot() and
> xol_free_insn_slot(), now. So it seems like your planned work might
> help here.

Andrii, I'm glad we've reached a similar result, The profiling result on
my machine reveals that about 80% cycles spent on the atomic operations
on area->bitmap and area->slot_count. I guess the atomic access leads to
the intensive cacheline bouncing bewteen CPUs.

In the passed weekend, I have been working on another patch that optimizes
the xol_take_insn_slot() and xol_free_inns_slot() for better scalability.
This involves delaying the freeing of xol insn slots to reduce the times
of atomic operations and cacheline bouncing. Additionally, per-task refcounts
and an RCU-style management of linked-list of allocated insn slots. In short
summary, this patch try to replace coarse-grained atomic variables with
finer-grained ones, aiming to elimiate the expensive atomic instructions
in the hot path. If you or others have bandwidth and interest, I'd welcome
a brainstorming session on this topic.

Thanks.

> 
>>
>>   [0] https://github.com/anakryiko/linux/commits/uprobes-lockless-cumulative/
>>
>>
>>> Thanks.
>>>
>>>>
>>>> But also see some remarks below.
>>>>
>>>>> Thanks.
>>>>>
>>>>> 在 2024/7/27 17:44, Liao Chang 写道:
>>>>>> The profiling result of single-thread model of selftests bench reveals
>>>>>> performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
>>>>>> ARM64. On my local testing machine, 5% of CPU time is consumed by
>>>>>> find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
>>>>>> about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.
>>>>>>
>>>>>> This patch introduce struct uprobe_breakpoint to track previously
>>>>>> allocated insn_slot for frequently hit uprobe. it effectively reduce the
>>>>>> need for redundant insn_slot writes and subsequent expensive cache
>>>>>> flush, especially on architecture like ARM64. This patch has been tested
>>>>>> on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
>>>>>> bench and Redis GET/SET benchmark result below reveal obivious
>>>>>> performance gain.
>>>>>>
>>>>>> before-opt
>>>>>> ----------
>>>>>> trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
>>>>>> trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
>>>>>> trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)
>>>>
>>>> I'm surprised that nop and push variants are much slower than ret
>>>> variant. This is exactly opposite on x86-64. Do you have an
>>>> explanation why this might be happening? I see you are trying to
>>>> optimize xol_get_insn_slot(), but that is (at least for x86) a slow
>>>> variant of uprobe that normally shouldn't be used. Typically uprobe is
>>>> installed on nop (for USDT) and on function entry (which would be push
>>>> variant, `push %rbp` instruction).
>>>>
>>>> ret variant, for x86-64, causes one extra step to go back to user
>>>> space to execute original instruction out-of-line, and then trapping
>>>> back to kernel for running uprobe. Which is what you normally want to
>>>> avoid.
>>>>
>>>> What I'm getting at here. It seems like maybe arm arch is missing fast
>>>> emulated implementations for nops/push or whatever equivalents for
>>>> ARM64 that is. Please take a look at that and see why those are slow
>>>> and whether you can make those into fast uprobe cases?
>>>
>>> I will spend the weekend figuring out the questions you raised. Thanks for
>>> pointing them out.
>>>
>>>>
>>>>>> trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
>>>>>> trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
>>>>>> trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
>>>>>> Redis SET (RPS) uprobe: 42728.52
>>>>>> Redis GET (RPS) uprobe: 43640.18
>>>>>> Redis SET (RPS) uretprobe: 40624.54
>>>>>> Redis GET (RPS) uretprobe: 41180.56
>>>>>>
>>>>>> after-opt
>>>>>> ---------
>>>>>> trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>>> trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>>> trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>>> trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>>> trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>>> trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>> Redis SET (RPS) uprobe: 43939.69
>>>>>> Redis GET (RPS) uprobe: 45200.80
>>>>>> Redis SET (RPS) uretprobe: 41658.58
>>>>>> Redis GET (RPS) uretprobe: 42805.80
>>>>>>
>>>>>> While some uprobes might still need to share the same insn_slot, this
>>>>>> patch compare the instructions in the resued insn_slot with the
>>>>>> instructions execute out-of-line firstly to decides allocate a new one
>>>>>> or not.
>>>>>>
>>>>>> Additionally, this patch use a rbtree associated with each thread that
>>>>>> hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
>>>>>> rbtree of uprobe_breakpoints has smaller node, better locality and less
>>>>>> contention, it result in faster lookup times compared to find_uprobe().
>>>>>>
>>>>>> The other part of this patch are some necessary memory management for
>>>>>> uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
>>>>>> hit uprobe that doesn't already have a corresponding node in rbtree. All
>>>>>> uprobe_breakpoints will be freed when thread exit.
>>>>>>
>>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>>> ---
>>>>>>  include/linux/uprobes.h |   3 +
>>>>>>  kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
>>>>>>  2 files changed, 211 insertions(+), 38 deletions(-)
>>>>>>
>>>>
>>>> [...]
>>>
>>> --
>>> BR
>>> Liao, Chang

-- 
BR
Liao, Chang

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Andrii Nakryiko 1 year, 5 months ago

On Mon, Aug 12, 2024 at 5:05 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/10 2:40, Andrii Nakryiko 写道:
> > On Fri, Aug 9, 2024 at 11:34 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> >>
> >> On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>
> >>>
> >>>
> >>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
> >>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>>>>
> >>>>> Hi Andrii and Oleg.
> >>>>>
> >>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
> >>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
> >>>>> within the mailing list. Considering this interest, I've added you and other relevant
> >>>>> maintainers to the CC list for broader visibility and potential collaboration.
> >>>>>
> >>>>
> >>>> Hi Liao,
> >>>>
> >>>> As you can see there is an active work to improve uprobes, that
> >>>> changes lifetime management of uprobes, removes a bunch of locks taken
> >>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
> >>>> hold off a bit with your changes until all that lands. And then
> >>>> re-benchmark, as costs might shift.
> >>>
> >>> Andrii, I'm trying to integrate your lockless changes into the upstream
> >>> next-20240806 kernel tree. And I ran into some conflicts. please let me
> >>> know which kernel you're currently working on.
> >>>
> >>
> >> My patches are  based on tip/perf/core. But I also just pushed all the
> >> changes I have accumulated (including patches I haven't sent for
> >> review just yet), plus your patches for sighand lock removed applied
> >> on top into [0]. So you can take a look and use that as a base for
> >> now. Keep in mind, a bunch of those patches might still change, but
> >> this should give you the best currently achievable performance with
> >> uprobes/uretprobes. E.g., I'm getting something like below on x86-64
> >> (note somewhat linear scalability with number of CPU cores, with
> >> per-CPU performance *slowly* declining):
> >>
> >> uprobe-nop            ( 1 cpus):    3.565 ± 0.004M/s  (  3.565M/s/cpu)
> >> uprobe-nop            ( 2 cpus):    6.742 ± 0.009M/s  (  3.371M/s/cpu)
> >> uprobe-nop            ( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
> >> uprobe-nop            ( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
> >> uprobe-nop            ( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
> >> uprobe-nop            ( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
> >> uprobe-nop            ( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
> >> uprobe-nop            ( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
> >> uprobe-nop            (10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
> >> uprobe-nop            (12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
> >> uprobe-nop            (14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
> >> uprobe-nop            (16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
> >> uprobe-nop            (24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
> >> uprobe-nop            (32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
> >> uprobe-nop            (40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
> >> uprobe-nop            (48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
> >> uprobe-nop            (56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
> >> uprobe-nop            (64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
> >> uprobe-nop            (72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
> >> uprobe-nop            (80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)
> >>
> >> uretprobe-nop         ( 1 cpus):    1.987 ± 0.001M/s  (  1.987M/s/cpu)
> >> uretprobe-nop         ( 2 cpus):    3.766 ± 0.003M/s  (  1.883M/s/cpu)
> >> uretprobe-nop         ( 3 cpus):    5.638 ± 0.002M/s  (  1.879M/s/cpu)
> >> uretprobe-nop         ( 4 cpus):    7.275 ± 0.003M/s  (  1.819M/s/cpu)
> >> uretprobe-nop         ( 5 cpus):    9.124 ± 0.004M/s  (  1.825M/s/cpu)
> >> uretprobe-nop         ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
> >> uretprobe-nop         ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
> >> uretprobe-nop         ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
> >> uretprobe-nop         (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
> >> uretprobe-nop         (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
> >> uretprobe-nop         (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
> >> uretprobe-nop         (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
> >> uretprobe-nop         (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/cpu)
> >> uretprobe-nop         (32 cpus):   44.869 ± 0.153M/s  (  1.402M/s/cpu)
> >> uretprobe-nop         (40 cpus):   53.397 ± 0.220M/s  (  1.335M/s/cpu)
> >> uretprobe-nop         (48 cpus):   48.903 ± 2.277M/s  (  1.019M/s/cpu)
> >> uretprobe-nop         (56 cpus):   42.144 ± 1.206M/s  (  0.753M/s/cpu)
> >> uretprobe-nop         (64 cpus):   42.656 ± 1.104M/s  (  0.666M/s/cpu)
> >> uretprobe-nop         (72 cpus):   46.299 ± 1.443M/s  (  0.643M/s/cpu)
> >> uretprobe-nop         (80 cpus):   46.469 ± 0.808M/s  (  0.581M/s/cpu)
> >>
> >> uprobe-ret            ( 1 cpus):    1.219 ± 0.008M/s  (  1.219M/s/cpu)
> >> uprobe-ret            ( 2 cpus):    1.862 ± 0.008M/s  (  0.931M/s/cpu)
> >> uprobe-ret            ( 3 cpus):    2.874 ± 0.005M/s  (  0.958M/s/cpu)
> >> uprobe-ret            ( 4 cpus):    3.512 ± 0.002M/s  (  0.878M/s/cpu)
> >> uprobe-ret            ( 5 cpus):    3.549 ± 0.001M/s  (  0.710M/s/cpu)
> >> uprobe-ret            ( 6 cpus):    3.425 ± 0.003M/s  (  0.571M/s/cpu)
> >> uprobe-ret            ( 7 cpus):    3.551 ± 0.009M/s  (  0.507M/s/cpu)
> >> uprobe-ret            ( 8 cpus):    3.050 ± 0.002M/s  (  0.381M/s/cpu)
> >> uprobe-ret            (10 cpus):    2.706 ± 0.002M/s  (  0.271M/s/cpu)
> >> uprobe-ret            (12 cpus):    2.588 ± 0.003M/s  (  0.216M/s/cpu)
> >> uprobe-ret            (14 cpus):    2.589 ± 0.003M/s  (  0.185M/s/cpu)
> >> uprobe-ret            (16 cpus):    2.575 ± 0.001M/s  (  0.161M/s/cpu)
> >> uprobe-ret            (24 cpus):    1.808 ± 0.011M/s  (  0.075M/s/cpu)
> >> uprobe-ret            (32 cpus):    1.853 ± 0.001M/s  (  0.058M/s/cpu)
> >> uprobe-ret            (40 cpus):    1.952 ± 0.002M/s  (  0.049M/s/cpu)
> >> uprobe-ret            (48 cpus):    2.075 ± 0.007M/s  (  0.043M/s/cpu)
> >> uprobe-ret            (56 cpus):    2.441 ± 0.004M/s  (  0.044M/s/cpu)
> >> uprobe-ret            (64 cpus):    1.880 ± 0.012M/s  (  0.029M/s/cpu)
> >> uprobe-ret            (72 cpus):    0.962 ± 0.002M/s  (  0.013M/s/cpu)
> >> uprobe-ret            (80 cpus):    1.040 ± 0.011M/s  (  0.013M/s/cpu)
> >>
> >> uretprobe-ret         ( 1 cpus):    0.981 ± 0.000M/s  (  0.981M/s/cpu)
> >> uretprobe-ret         ( 2 cpus):    1.421 ± 0.001M/s  (  0.711M/s/cpu)
> >> uretprobe-ret         ( 3 cpus):    2.050 ± 0.003M/s  (  0.683M/s/cpu)
> >> uretprobe-ret         ( 4 cpus):    2.596 ± 0.002M/s  (  0.649M/s/cpu)
> >> uretprobe-ret         ( 5 cpus):    3.105 ± 0.003M/s  (  0.621M/s/cpu)
> >> uretprobe-ret         ( 6 cpus):    3.886 ± 0.002M/s  (  0.648M/s/cpu)
> >> uretprobe-ret         ( 7 cpus):    3.016 ± 0.001M/s  (  0.431M/s/cpu)
> >> uretprobe-ret         ( 8 cpus):    2.903 ± 0.000M/s  (  0.363M/s/cpu)
> >> uretprobe-ret         (10 cpus):    2.755 ± 0.001M/s  (  0.276M/s/cpu)
> >> uretprobe-ret         (12 cpus):    2.400 ± 0.001M/s  (  0.200M/s/cpu)
> >> uretprobe-ret         (14 cpus):    3.972 ± 0.001M/s  (  0.284M/s/cpu)
> >> uretprobe-ret         (16 cpus):    3.940 ± 0.003M/s  (  0.246M/s/cpu)
> >> uretprobe-ret         (24 cpus):    3.002 ± 0.003M/s  (  0.125M/s/cpu)
> >> uretprobe-ret         (32 cpus):    3.018 ± 0.003M/s  (  0.094M/s/cpu)
> >> uretprobe-ret         (40 cpus):    1.846 ± 0.000M/s  (  0.046M/s/cpu)
> >> uretprobe-ret         (48 cpus):    2.487 ± 0.004M/s  (  0.052M/s/cpu)
> >> uretprobe-ret         (56 cpus):    2.470 ± 0.006M/s  (  0.044M/s/cpu)
> >> uretprobe-ret         (64 cpus):    2.027 ± 0.014M/s  (  0.032M/s/cpu)
> >> uretprobe-ret         (72 cpus):    1.108 ± 0.011M/s  (  0.015M/s/cpu)
> >> uretprobe-ret         (80 cpus):    0.982 ± 0.005M/s  (  0.012M/s/cpu)
> >>
> >>
> >> -ret variants (single-stepping case for x86-64) still suck, but they
> >> suck 2x less now with your patches :) Clearly more work ahead for
> >> those, though.
> >>
> >
> > Quick profiling shows that it's mostly xol_take_insn_slot() and
> > xol_free_insn_slot(), now. So it seems like your planned work might
> > help here.
>
> Andrii, I'm glad we've reached a similar result, The profiling result on
> my machine reveals that about 80% cycles spent on the atomic operations
> on area->bitmap and area->slot_count. I guess the atomic access leads to
> the intensive cacheline bouncing bewteen CPUs.
>
> In the passed weekend, I have been working on another patch that optimizes
> the xol_take_insn_slot() and xol_free_inns_slot() for better scalability.
> This involves delaying the freeing of xol insn slots to reduce the times
> of atomic operations and cacheline bouncing. Additionally, per-task refcounts
> and an RCU-style management of linked-list of allocated insn slots. In short
> summary, this patch try to replace coarse-grained atomic variables with
> finer-grained ones, aiming to elimiate the expensive atomic instructions
> in the hot path. If you or others have bandwidth and interest, I'd welcome
> a brainstorming session on this topic.

I'm happy to help, but I still feel like it's best to concentrate on
landing all the other pending things for uprobe, and then switch to
optimizing the xol case.

We have:
  - RCU protection and avoiding refcounting for uprobes (I'll be
sending latest revision soon);
  - SRCU+timeout for uretprobe and single-step (pending the above
landing first);
  - removing shared nhit counter increment in trace_uprobe (I've sent
patches last week, see [0]);
  - lockless VMA -> inode -> uprobe look up (also pending for #1 to
land, and some more benchmarking for mm_lock_seq changes from Suren,
see [1]);
  - and, of course, your work to remove sighand lock.

So as you can see, there is plenty to discuss and land already, I just
don't want to spread the efforts too thin. But if you can help improve
the benchmark for ARM64, that would be a great parallel effort setting
us up for further work nicely. Thanks!

  [0] https://lore.kernel.org/bpf/20240809192357.4061484-1-andrii@kernel.org/
  [1] https://lore.kernel.org/linux-mm/CAEf4BzaocU-CQsFZ=s5gDM6XQ0Foss_HroFsPUesBn=qgJCprg@mail.gmail.com/

>
> Thanks.
>
> >
> >>
> >>   [0] https://github.com/anakryiko/linux/commits/uprobes-lockless-cumulative/
> >>
> >>
> >>> Thanks.
> >>>
> >>>>
> >>>> But also see some remarks below.
> >>>>
> >>>>> Thanks.
> >>>>>

[...]

Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

Posted by Liao, Chang 1 year, 5 months ago


在 2024/8/13 1:57, Andrii Nakryiko 写道:
> On Mon, Aug 12, 2024 at 5:05 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/8/10 2:40, Andrii Nakryiko 写道:
>>> On Fri, Aug 9, 2024 at 11:34 AM Andrii Nakryiko
>>> <andrii.nakryiko@gmail.com> wrote:
>>>>
>>>> On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> 在 2024/8/9 2:26, Andrii Nakryiko 写道:
>>>>>> On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>>>>
>>>>>>> Hi Andrii and Oleg.
>>>>>>>
>>>>>>> This patch sent by me two weeks ago also aim to optimize the performance of uprobe
>>>>>>> on arm64. I notice recent discussions on the performance and scalability of uprobes
>>>>>>> within the mailing list. Considering this interest, I've added you and other relevant
>>>>>>> maintainers to the CC list for broader visibility and potential collaboration.
>>>>>>>
>>>>>>
>>>>>> Hi Liao,
>>>>>>
>>>>>> As you can see there is an active work to improve uprobes, that
>>>>>> changes lifetime management of uprobes, removes a bunch of locks taken
>>>>>> in the uprobe/uretprobe hot path, etc. It would be nice if you can
>>>>>> hold off a bit with your changes until all that lands. And then
>>>>>> re-benchmark, as costs might shift.
>>>>>
>>>>> Andrii, I'm trying to integrate your lockless changes into the upstream
>>>>> next-20240806 kernel tree. And I ran into some conflicts. please let me
>>>>> know which kernel you're currently working on.
>>>>>
>>>>
>>>> My patches are  based on tip/perf/core. But I also just pushed all the
>>>> changes I have accumulated (including patches I haven't sent for
>>>> review just yet), plus your patches for sighand lock removed applied
>>>> on top into [0]. So you can take a look and use that as a base for
>>>> now. Keep in mind, a bunch of those patches might still change, but
>>>> this should give you the best currently achievable performance with
>>>> uprobes/uretprobes. E.g., I'm getting something like below on x86-64
>>>> (note somewhat linear scalability with number of CPU cores, with
>>>> per-CPU performance *slowly* declining):
>>>>
>>>> uprobe-nop            ( 1 cpus):    3.565 ± 0.004M/s  (  3.565M/s/cpu)
>>>> uprobe-nop            ( 2 cpus):    6.742 ± 0.009M/s  (  3.371M/s/cpu)
>>>> uprobe-nop            ( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
>>>> uprobe-nop            ( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
>>>> uprobe-nop            ( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
>>>> uprobe-nop            ( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
>>>> uprobe-nop            ( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
>>>> uprobe-nop            ( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
>>>> uprobe-nop            (10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
>>>> uprobe-nop            (12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
>>>> uprobe-nop            (14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
>>>> uprobe-nop            (16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
>>>> uprobe-nop            (24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
>>>> uprobe-nop            (32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
>>>> uprobe-nop            (40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
>>>> uprobe-nop            (48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
>>>> uprobe-nop            (56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
>>>> uprobe-nop            (64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
>>>> uprobe-nop            (72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
>>>> uprobe-nop            (80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)
>>>>
>>>> uretprobe-nop         ( 1 cpus):    1.987 ± 0.001M/s  (  1.987M/s/cpu)
>>>> uretprobe-nop         ( 2 cpus):    3.766 ± 0.003M/s  (  1.883M/s/cpu)
>>>> uretprobe-nop         ( 3 cpus):    5.638 ± 0.002M/s  (  1.879M/s/cpu)
>>>> uretprobe-nop         ( 4 cpus):    7.275 ± 0.003M/s  (  1.819M/s/cpu)
>>>> uretprobe-nop         ( 5 cpus):    9.124 ± 0.004M/s  (  1.825M/s/cpu)
>>>> uretprobe-nop         ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
>>>> uretprobe-nop         ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
>>>> uretprobe-nop         ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
>>>> uretprobe-nop         (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
>>>> uretprobe-nop         (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
>>>> uretprobe-nop         (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
>>>> uretprobe-nop         (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
>>>> uretprobe-nop         (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/cpu)
>>>> uretprobe-nop         (32 cpus):   44.869 ± 0.153M/s  (  1.402M/s/cpu)
>>>> uretprobe-nop         (40 cpus):   53.397 ± 0.220M/s  (  1.335M/s/cpu)
>>>> uretprobe-nop         (48 cpus):   48.903 ± 2.277M/s  (  1.019M/s/cpu)
>>>> uretprobe-nop         (56 cpus):   42.144 ± 1.206M/s  (  0.753M/s/cpu)
>>>> uretprobe-nop         (64 cpus):   42.656 ± 1.104M/s  (  0.666M/s/cpu)
>>>> uretprobe-nop         (72 cpus):   46.299 ± 1.443M/s  (  0.643M/s/cpu)
>>>> uretprobe-nop         (80 cpus):   46.469 ± 0.808M/s  (  0.581M/s/cpu)
>>>>
>>>> uprobe-ret            ( 1 cpus):    1.219 ± 0.008M/s  (  1.219M/s/cpu)
>>>> uprobe-ret            ( 2 cpus):    1.862 ± 0.008M/s  (  0.931M/s/cpu)
>>>> uprobe-ret            ( 3 cpus):    2.874 ± 0.005M/s  (  0.958M/s/cpu)
>>>> uprobe-ret            ( 4 cpus):    3.512 ± 0.002M/s  (  0.878M/s/cpu)
>>>> uprobe-ret            ( 5 cpus):    3.549 ± 0.001M/s  (  0.710M/s/cpu)
>>>> uprobe-ret            ( 6 cpus):    3.425 ± 0.003M/s  (  0.571M/s/cpu)
>>>> uprobe-ret            ( 7 cpus):    3.551 ± 0.009M/s  (  0.507M/s/cpu)
>>>> uprobe-ret            ( 8 cpus):    3.050 ± 0.002M/s  (  0.381M/s/cpu)
>>>> uprobe-ret            (10 cpus):    2.706 ± 0.002M/s  (  0.271M/s/cpu)
>>>> uprobe-ret            (12 cpus):    2.588 ± 0.003M/s  (  0.216M/s/cpu)
>>>> uprobe-ret            (14 cpus):    2.589 ± 0.003M/s  (  0.185M/s/cpu)
>>>> uprobe-ret            (16 cpus):    2.575 ± 0.001M/s  (  0.161M/s/cpu)
>>>> uprobe-ret            (24 cpus):    1.808 ± 0.011M/s  (  0.075M/s/cpu)
>>>> uprobe-ret            (32 cpus):    1.853 ± 0.001M/s  (  0.058M/s/cpu)
>>>> uprobe-ret            (40 cpus):    1.952 ± 0.002M/s  (  0.049M/s/cpu)
>>>> uprobe-ret            (48 cpus):    2.075 ± 0.007M/s  (  0.043M/s/cpu)
>>>> uprobe-ret            (56 cpus):    2.441 ± 0.004M/s  (  0.044M/s/cpu)
>>>> uprobe-ret            (64 cpus):    1.880 ± 0.012M/s  (  0.029M/s/cpu)
>>>> uprobe-ret            (72 cpus):    0.962 ± 0.002M/s  (  0.013M/s/cpu)
>>>> uprobe-ret            (80 cpus):    1.040 ± 0.011M/s  (  0.013M/s/cpu)
>>>>
>>>> uretprobe-ret         ( 1 cpus):    0.981 ± 0.000M/s  (  0.981M/s/cpu)
>>>> uretprobe-ret         ( 2 cpus):    1.421 ± 0.001M/s  (  0.711M/s/cpu)
>>>> uretprobe-ret         ( 3 cpus):    2.050 ± 0.003M/s  (  0.683M/s/cpu)
>>>> uretprobe-ret         ( 4 cpus):    2.596 ± 0.002M/s  (  0.649M/s/cpu)
>>>> uretprobe-ret         ( 5 cpus):    3.105 ± 0.003M/s  (  0.621M/s/cpu)
>>>> uretprobe-ret         ( 6 cpus):    3.886 ± 0.002M/s  (  0.648M/s/cpu)
>>>> uretprobe-ret         ( 7 cpus):    3.016 ± 0.001M/s  (  0.431M/s/cpu)
>>>> uretprobe-ret         ( 8 cpus):    2.903 ± 0.000M/s  (  0.363M/s/cpu)
>>>> uretprobe-ret         (10 cpus):    2.755 ± 0.001M/s  (  0.276M/s/cpu)
>>>> uretprobe-ret         (12 cpus):    2.400 ± 0.001M/s  (  0.200M/s/cpu)
>>>> uretprobe-ret         (14 cpus):    3.972 ± 0.001M/s  (  0.284M/s/cpu)
>>>> uretprobe-ret         (16 cpus):    3.940 ± 0.003M/s  (  0.246M/s/cpu)
>>>> uretprobe-ret         (24 cpus):    3.002 ± 0.003M/s  (  0.125M/s/cpu)
>>>> uretprobe-ret         (32 cpus):    3.018 ± 0.003M/s  (  0.094M/s/cpu)
>>>> uretprobe-ret         (40 cpus):    1.846 ± 0.000M/s  (  0.046M/s/cpu)
>>>> uretprobe-ret         (48 cpus):    2.487 ± 0.004M/s  (  0.052M/s/cpu)
>>>> uretprobe-ret         (56 cpus):    2.470 ± 0.006M/s  (  0.044M/s/cpu)
>>>> uretprobe-ret         (64 cpus):    2.027 ± 0.014M/s  (  0.032M/s/cpu)
>>>> uretprobe-ret         (72 cpus):    1.108 ± 0.011M/s  (  0.015M/s/cpu)
>>>> uretprobe-ret         (80 cpus):    0.982 ± 0.005M/s  (  0.012M/s/cpu)
>>>>
>>>>
>>>> -ret variants (single-stepping case for x86-64) still suck, but they
>>>> suck 2x less now with your patches :) Clearly more work ahead for
>>>> those, though.
>>>>
>>>
>>> Quick profiling shows that it's mostly xol_take_insn_slot() and
>>> xol_free_insn_slot(), now. So it seems like your planned work might
>>> help here.
>>
>> Andrii, I'm glad we've reached a similar result, The profiling result on
>> my machine reveals that about 80% cycles spent on the atomic operations
>> on area->bitmap and area->slot_count. I guess the atomic access leads to
>> the intensive cacheline bouncing bewteen CPUs.
>>
>> In the passed weekend, I have been working on another patch that optimizes
>> the xol_take_insn_slot() and xol_free_inns_slot() for better scalability.
>> This involves delaying the freeing of xol insn slots to reduce the times
>> of atomic operations and cacheline bouncing. Additionally, per-task refcounts
>> and an RCU-style management of linked-list of allocated insn slots. In short
>> summary, this patch try to replace coarse-grained atomic variables with
>> finer-grained ones, aiming to elimiate the expensive atomic instructions
>> in the hot path. If you or others have bandwidth and interest, I'd welcome
>> a brainstorming session on this topic.
> 
> I'm happy to help, but I still feel like it's best to concentrate on
> landing all the other pending things for uprobe, and then switch to
> optimizing the xol case.
> 
> We have:
>   - RCU protection and avoiding refcounting for uprobes (I'll be
> sending latest revision soon);
>   - SRCU+timeout for uretprobe and single-step (pending the above
> landing first);
>   - removing shared nhit counter increment in trace_uprobe (I've sent
> patches last week, see [0]);
>   - lockless VMA -> inode -> uprobe look up (also pending for #1 to
> land, and some more benchmarking for mm_lock_seq changes from Suren,
> see [1]);
>   - and, of course, your work to remove sighand lock.
> 
> So as you can see, there is plenty to discuss and land already, I just
> don't want to spread the efforts too thin. But if you can help improve
> the benchmark for ARM64, that would be a great parallel effort setting
> us up for further work nicely. Thanks!

Agree. Let's prioritize landing your exising patches. I'll build upon
you works for the further uprobe optimizatoin for Arm64.

Thanks.

> 
>   [0] https://lore.kernel.org/bpf/20240809192357.4061484-1-andrii@kernel.org/
>   [1] https://lore.kernel.org/linux-mm/CAEf4BzaocU-CQsFZ=s5gDM6XQ0Foss_HroFsPUesBn=qgJCprg@mail.gmail.com/
> 
>>
>> Thanks.
>>
>>>
>>>>
>>>>   [0] https://github.com/anakryiko/linux/commits/uprobes-lockless-cumulative/
>>>>
>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> But also see some remarks below.
>>>>>>
>>>>>>> Thanks.
>>>>>>>
> 
> [...]

-- 
BR
Liao, Chang