From nobody Sun Feb 8 17:04:46 2026 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F9C3383B2; Sat, 27 Jul 2024 09:53:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722073988; cv=none; b=TeHbVKRHhTWwXc9cX+Kn7UJA/XyYKrol7N1FQYgAJT6SRS2fkb3QxC4b/IKIyGFJVIMJdyrjZzQ1i+5Y1loLDnHWOfHnA0/lYFefeWh7C1vCe8MLEN8Q3Bawlz8QjDV7/wFEe0lpokWrYviYRdO88QalFqTl9xOwaFZFt0lrYEw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722073988; c=relaxed/simple; bh=kW3zb6SBCoPJO7jqQ4z+6lwe/Gl12NKwCK2xChriDgk=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=L4WlQs6TVfMXBCTjQPud++f5XjWSVflfwbnc2lPFHljzm5UKTYBCEMtNsGt23uIqYJfkIEIujX+f+SDOdltZcL5XHhG87G8p30lLxtfnSZmcr98eoiKu8l7B9ui1dkSShQghJQRIxZRnME/4bAZrmDKtTT6dYbFrEyQjtdWwaCU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=45.249.212.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.19.162.254]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4WWKfB2xL4znccM; Sat, 27 Jul 2024 17:52:02 +0800 (CST) Received: from kwepemd200013.china.huawei.com (unknown [7.221.188.133]) by mail.maildlp.com (Postfix) with ESMTPS id AE4CD1800FF; Sat, 27 Jul 2024 17:52:55 +0800 (CST) Received: from huawei.com (10.67.174.28) by kwepemd200013.china.huawei.com (7.221.188.133) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.34; Sat, 27 Jul 2024 17:52:55 +0800 From: Liao Chang To: , , , , , , , , , , CC: , Subject: [PATCH] uprobes: Optimize the allocation of insn_slot for performance Date: Sat, 27 Jul 2024 09:44:05 +0000 Message-ID: <20240727094405.1362496-1-liaochang1@huawei.com> X-Mailer: git-send-email 2.34.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To kwepemd200013.china.huawei.com (7.221.188.133) The profiling result of single-thread model of selftests bench reveals performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on ARM64. On my local testing machine, 5% of CPU time is consumed by find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push. This patch introduce struct uprobe_breakpoint to track previously allocated insn_slot for frequently hit uprobe. it effectively reduce the need for redundant insn_slot writes and subsequent expensive cache flush, especially on architecture like ARM64. This patch has been tested on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest bench and Redis GET/SET benchmark result below reveal obivious performance gain. before-opt ---------- trig-uprobe-nop: 0.371 =C2=B1 0.001M/s (0.371M/prod) trig-uprobe-push: 0.370 =C2=B1 0.001M/s (0.370M/prod) trig-uprobe-ret: 1.637 =C2=B1 0.001M/s (1.647M/prod) trig-uretprobe-nop: 0.331 =C2=B1 0.004M/s (0.331M/prod) trig-uretprobe-push: 0.333 =C2=B1 0.000M/s (0.333M/prod) trig-uretprobe-ret: 0.854 =C2=B1 0.002M/s (0.854M/prod) Redis SET (RPS) uprobe: 42728.52 Redis GET (RPS) uprobe: 43640.18 Redis SET (RPS) uretprobe: 40624.54 Redis GET (RPS) uretprobe: 41180.56 after-opt --------- trig-uprobe-nop: 0.916 =C2=B1 0.001M/s (0.916M/prod) trig-uprobe-push: 0.908 =C2=B1 0.001M/s (0.908M/prod) trig-uprobe-ret: 1.855 =C2=B1 0.000M/s (1.855M/prod) trig-uretprobe-nop: 0.640 =C2=B1 0.000M/s (0.640M/prod) trig-uretprobe-push: 0.633 =C2=B1 0.001M/s (0.633M/prod) trig-uretprobe-ret: 0.978 =C2=B1 0.003M/s (0.978M/prod) Redis SET (RPS) uprobe: 43939.69 Redis GET (RPS) uprobe: 45200.80 Redis SET (RPS) uretprobe: 41658.58 Redis GET (RPS) uretprobe: 42805.80 While some uprobes might still need to share the same insn_slot, this patch compare the instructions in the resued insn_slot with the instructions execute out-of-line firstly to decides allocate a new one or not. Additionally, this patch use a rbtree associated with each thread that hit uprobes to manage these allocated uprobe_breakpoint data. Due to the rbtree of uprobe_breakpoints has smaller node, better locality and less contention, it result in faster lookup times compared to find_uprobe(). The other part of this patch are some necessary memory management for uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly hit uprobe that doesn't already have a corresponding node in rbtree. All uprobe_breakpoints will be freed when thread exit. Signed-off-by: Liao Chang --- include/linux/uprobes.h | 3 + kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++------- 2 files changed, 211 insertions(+), 38 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f46e0ca0169c..04ee465980af 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -78,6 +78,9 @@ struct uprobe_task { =20 struct return_instance *return_instances; unsigned int depth; + + struct rb_root breakpoints_tree; + rwlock_t breakpoints_treelock; }; =20 struct return_instance { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 2c83ba776fc7..3f1a6dc2a327 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -33,6 +33,7 @@ #define MAX_UPROBE_XOL_SLOTS UINSNS_PER_PAGE =20 static struct rb_root uprobes_tree =3D RB_ROOT; + /* * allows us to skip the uprobe_mmap if there are no uprobe events active * at this time. Probably a fine grained per inode count is better? @@ -886,6 +887,174 @@ static bool filter_chain(struct uprobe *uprobe, return ret; } =20 +static struct uprobe_task *get_utask(void); +static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr); +static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_s= wbp); + +struct uprobe_breakpoint { + struct rb_node rb_node; + refcount_t ref; + unsigned long slot; + unsigned long vaddr; + struct uprobe *uprobe; +}; + +static void put_ubp(struct uprobe_breakpoint *ubp) +{ + if (refcount_dec_and_test(&ubp->ref)) { + put_uprobe(ubp->uprobe); + kfree(ubp); + } +} + +static struct uprobe_breakpoint *get_ubp(struct uprobe_breakpoint *ubp) +{ + refcount_inc(&ubp->ref); + return ubp; +} + +#define __node_2_ubp(node) rb_entry((node), struct uprobe_breakpoint, rb_n= ode) + +struct __ubp_key { + unsigned long bp_vaddr; +}; + +static int ubp_cmp(const unsigned long bp_vaddr, + const struct uprobe_breakpoint *ubp) +{ + if (bp_vaddr < ubp->vaddr) + return -1; + + if (bp_vaddr > ubp->vaddr) + return 1; + + return 0; +} + +static int __ubp_cmp_key(const void *k, const struct rb_node *b) +{ + const struct __ubp_key *key =3D k; + + return ubp_cmp(key->bp_vaddr, __node_2_ubp(b)); +} + +static int __ubp_cmp(struct rb_node *a, const struct rb_node *b) +{ + const struct uprobe_breakpoint *ubp =3D __node_2_ubp(a); + + return ubp_cmp(ubp->vaddr, __node_2_ubp(b)); +} + +static void delete_breakpoint(struct uprobe_task *utask) +{ + struct rb_node *node; + + write_lock(&utask->breakpoints_treelock); + node =3D rb_first(&utask->breakpoints_tree); + while (node) { + rb_erase(node, &utask->breakpoints_tree); + write_unlock(&utask->breakpoints_treelock); + + put_ubp(__node_2_ubp(node)); + + write_lock(&utask->breakpoints_treelock); + node =3D rb_next(node); + } + write_unlock(&utask->breakpoints_treelock); +} + +static struct uprobe_breakpoint *alloc_breakpoint(struct uprobe_task *utas= k, + struct uprobe *uprobe, + unsigned long bp_vaddr) +{ + struct uprobe_breakpoint *ubp; + struct rb_node *node; + + ubp =3D kzalloc(sizeof(struct uprobe_breakpoint), GFP_KERNEL); + if (!ubp) + return NULL; + + ubp->vaddr =3D bp_vaddr; + ubp->uprobe =3D uprobe; + /* get access + creation ref */ + refcount_set(&ubp->ref, 2); + ubp->slot =3D UINSNS_PER_PAGE; + + write_lock(&utask->breakpoints_treelock); + node =3D rb_find_add(&ubp->rb_node, &utask->breakpoints_tree, __ubp_cmp); + write_unlock(&utask->breakpoints_treelock); + + /* Two or more threads hit the same breakpoint */ + if (node) { + WARN_ON(uprobe !=3D __node_2_ubp(node)->upobre); + kfree(ubp); + return get_ubp(__node_2_ubp(node)); + } + + return ubp; +} + +static struct uprobe_breakpoint *find_breakpoint(struct uprobe_task *utask, + unsigned long bp_vaddr) +{ + struct rb_node *node; + struct __ubp_key key =3D { + .bp_vaddr =3D bp_vaddr, + }; + + read_lock(&utask->breakpoints_treelock); + node =3D rb_find(&key, &utask->breakpoints_tree, __ubp_cmp_key); + read_unlock(&utask->breakpoints_treelock); + + if (node) + return get_ubp(__node_2_ubp(node)); + + return NULL; +} + +static struct uprobe_breakpoint *find_active_breakpoint(struct pt_regs *re= gs, + unsigned long bp_vaddr) +{ + struct uprobe_task *utask =3D get_utask(); + struct uprobe_breakpoint *ubp; + struct uprobe *uprobe; + int is_swbp; + + if (unlikely(!utask)) + return NULL; + + ubp =3D find_breakpoint(utask, bp_vaddr); + if (ubp) + return ubp; + + uprobe =3D find_active_uprobe(bp_vaddr, &is_swbp); + if (!uprobe) { + if (is_swbp > 0) { + /* No matching uprobe; signal SIGTRAP. */ + force_sig(SIGTRAP); + } else { + /* + * Either we raced with uprobe_unregister() or we can't + * access this memory. The latter is only possible if + * another thread plays with our ->mm. In both cases + * we can simply restart. If this vma was unmapped we + * can pretend this insn was not executed yet and get + * the (correct) SIGSEGV after restart. + */ + instruction_pointer_set(regs, bp_vaddr); + } + return NULL; + } + + ubp =3D alloc_breakpoint(utask, uprobe, bp_vaddr); + if (!ubp) { + put_uprobe(uprobe); + return NULL; + } + + return ubp; +} + static int install_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, struct vm_area_struct *vma, unsigned long vaddr) @@ -1576,9 +1745,8 @@ void uprobe_dup_mmap(struct mm_struct *oldmm, struct = mm_struct *newmm) /* * - search for a free slot. */ -static unsigned long xol_take_insn_slot(struct xol_area *area) +static __always_inline int xol_take_insn_slot(struct xol_area *area) { - unsigned long slot_addr; int slot_nr; =20 do { @@ -1590,34 +1758,48 @@ static unsigned long xol_take_insn_slot(struct xol_= area *area) slot_nr =3D UINSNS_PER_PAGE; continue; } - wait_event(area->wq, (atomic_read(&area->slot_count) < UINSNS_PER_PAGE)); + wait_event(area->wq, + (atomic_read(&area->slot_count) < UINSNS_PER_PAGE)); } while (slot_nr >=3D UINSNS_PER_PAGE); =20 - slot_addr =3D area->vaddr + (slot_nr * UPROBE_XOL_SLOT_BYTES); - atomic_inc(&area->slot_count); + return slot_nr; +} + +static __always_inline unsigned long +choose_insn_slot(struct xol_area *area, struct uprobe_breakpoint *ubp) +{ + if ((ubp->slot =3D=3D UINSNS_PER_PAGE) || + test_and_set_bit(ubp->slot, area->bitmap)) { + ubp->slot =3D xol_take_insn_slot(area); + } =20 - return slot_addr; + atomic_inc(&area->slot_count); + return area->vaddr + ubp->slot * UPROBE_XOL_SLOT_BYTES; } =20 /* * xol_get_insn_slot - allocate a slot for xol. * Returns the allocated slot address or 0. */ -static unsigned long xol_get_insn_slot(struct uprobe *uprobe) +static unsigned long xol_get_insn_slot(struct uprobe_breakpoint *ubp) { struct xol_area *area; unsigned long xol_vaddr; + struct uprobe *uprobe =3D ubp->uprobe; =20 area =3D get_xol_area(); if (!area) return 0; =20 - xol_vaddr =3D xol_take_insn_slot(area); + xol_vaddr =3D choose_insn_slot(area, ubp); if (unlikely(!xol_vaddr)) return 0; =20 - arch_uprobe_copy_ixol(area->pages[0], xol_vaddr, - &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); + if (memcmp((void *)xol_vaddr, &uprobe->arch.ixol, + sizeof(uprobe->arch.ixol))) + arch_uprobe_copy_ixol(area->pages[0], xol_vaddr, + &uprobe->arch.ixol, + sizeof(uprobe->arch.ixol)); =20 return xol_vaddr; } @@ -1717,8 +1899,7 @@ void uprobe_free_utask(struct task_struct *t) if (!utask) return; =20 - if (utask->active_uprobe) - put_uprobe(utask->active_uprobe); + delete_breakpoint(utask); =20 ri =3D utask->return_instances; while (ri) @@ -1739,8 +1920,11 @@ void uprobe_free_utask(struct task_struct *t) */ static struct uprobe_task *get_utask(void) { - if (!current->utask) + if (!current->utask) { current->utask =3D kzalloc(sizeof(struct uprobe_task), GFP_KERNEL); + current->utask->breakpoints_tree =3D RB_ROOT; + rwlock_init(¤t->utask->breakpoints_treelock); + } return current->utask; } =20 @@ -1921,7 +2105,8 @@ static void prepare_uretprobe(struct uprobe *uprobe, = struct pt_regs *regs) =20 /* Prepare to single-step probed instruction out of line. */ static int -pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_va= ddr) +pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, + struct uprobe_breakpoint *ubp) { struct uprobe_task *utask; unsigned long xol_vaddr; @@ -1931,12 +2116,12 @@ pre_ssout(struct uprobe *uprobe, struct pt_regs *re= gs, unsigned long bp_vaddr) if (!utask) return -ENOMEM; =20 - xol_vaddr =3D xol_get_insn_slot(uprobe); + xol_vaddr =3D xol_get_insn_slot(ubp); if (!xol_vaddr) return -ENOMEM; =20 utask->xol_vaddr =3D xol_vaddr; - utask->vaddr =3D bp_vaddr; + utask->vaddr =3D ubp->vaddr; =20 err =3D arch_uprobe_pre_xol(&uprobe->arch, regs); if (unlikely(err)) { @@ -2182,32 +2367,19 @@ bool __weak arch_uretprobe_is_alive(struct return_i= nstance *ret, enum rp_check c */ static void handle_swbp(struct pt_regs *regs) { + struct uprobe_breakpoint *ubp; struct uprobe *uprobe; unsigned long bp_vaddr; - int is_swbp; =20 bp_vaddr =3D uprobe_get_swbp_addr(regs); if (bp_vaddr =3D=3D get_trampoline_vaddr()) return handle_trampoline(regs); =20 - uprobe =3D find_active_uprobe(bp_vaddr, &is_swbp); - if (!uprobe) { - if (is_swbp > 0) { - /* No matching uprobe; signal SIGTRAP. */ - force_sig(SIGTRAP); - } else { - /* - * Either we raced with uprobe_unregister() or we can't - * access this memory. The latter is only possible if - * another thread plays with our ->mm. In both cases - * we can simply restart. If this vma was unmapped we - * can pretend this insn was not executed yet and get - * the (correct) SIGSEGV after restart. - */ - instruction_pointer_set(regs, bp_vaddr); - } + ubp =3D find_active_breakpoint(regs, bp_vaddr); + if (!ubp) return; - } + + uprobe =3D ubp->uprobe; =20 /* change it in advance for ->handler() and restart */ instruction_pointer_set(regs, bp_vaddr); @@ -2241,12 +2413,11 @@ static void handle_swbp(struct pt_regs *regs) if (arch_uprobe_skip_sstep(&uprobe->arch, regs)) goto out; =20 - if (!pre_ssout(uprobe, regs, bp_vaddr)) - return; + pre_ssout(uprobe, regs, ubp); =20 /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */ out: - put_uprobe(uprobe); + put_ubp(ubp); } =20 /* @@ -2266,7 +2437,6 @@ static void handle_singlestep(struct uprobe_task *uta= sk, struct pt_regs *regs) else WARN_ON_ONCE(1); =20 - put_uprobe(uprobe); utask->active_uprobe =3D NULL; utask->state =3D UTASK_RUNNING; xol_free_insn_slot(current); --=20 2.34.1