From nobody Thu Nov 28 13:36:49 2024 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 163B817BED0; Mon, 30 Sep 2024 21:22:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727731374; cv=none; b=RvalF+rHIX3E7ThEwGbmZ7ONfPX/eapYEcPYtw+M42ogdK/h0XnBfOIUsIgbsJfxoQypGr5QZuX8YrbFdlyfPpJ/xeeY7n7tFCIgb7CYMHvm45QO4hZh2K04pdNcu0ogolAR56kIomefSf1Z6NY9Y5rRImZg/iAqil8+hno9rKk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1727731374; c=relaxed/simple; bh=SgoJH7Q0P5MTw/0ovdeuxG+AMxgWI7n5wJo/kqe++Eo=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=TgrME+AKqEcRXsljMigXnh4CEO9yqLfZBdFasxROcvFfhVGMhes/wWU9JxfEAqpYqqiIbKU4YOCgQpugLB1XgIXEaYBvN2zbrChTMo9bzxga/+zOJgklSw+z8KqQlqVED3n+ZHkZX1AD2NBOLgXPi4XQ6vzAUW1wDCddO/BdDiI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=g2jWnpG1; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="g2jWnpG1" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 771FFC4CEC7; Mon, 30 Sep 2024 21:22:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1727731373; bh=SgoJH7Q0P5MTw/0ovdeuxG+AMxgWI7n5wJo/kqe++Eo=; h=From:To:Cc:Subject:Date:From; b=g2jWnpG1AccfYh10CNzhdN+SxPOTJdxousnSlCKGM8x75ZhFCpZ80mNeFnMMkDZm7 6pYXX7qwkO8HLzUIb4iIS75zV5d6bgoIyc3hhuGRgeFIeGkDoxMswsE9O6GZ0Ixnv+ zZpJvmFcDPLHeEoW2vR1i3md1gpL3jVkgeOrthsuk+5Sq1Bq7C5FIri1YW8IqGDGRc FeVV2gwD7Ix7SCOUzdw3juU8N8PjuQUh/fD/RXQPOBEr3c11PMLC+1cRIDRzUjO5mz dtWQgJTy18gP6qA4LYANwyMZvNW2afIbhAYKz4Ko1PJ9r14ZUwjIZZ2+ND+3+b7Dvr nqKUHUGVDOq1g== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, peterz@infradead.org, mingo@kernel.org Cc: oleg@redhat.com, rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, Andrii Nakryiko Subject: [PATCH RESEND tip/perf/core] uprobes: switch to RCU Tasks Trace flavor for better performance Date: Mon, 30 Sep 2024 14:22:46 -0700 Message-ID: <20240930212246.1829395-1-andrii@kernel.org> X-Mailer: git-send-email 2.43.5 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable This patch switches uprobes SRCU usage to RCU Tasks Trace flavor, which is optimized for more lightweight and quick readers (at the expense of slower writers, which for uprobes is a fine trade-off) and has better performance and scalability with number of CPUs. Similarly to baseline vs SRCU, we've benchmarked SRCU-based implementation vs RCU Tasks Trace implementation. SRCU =3D=3D=3D=3D uprobe-nop ( 1 cpus): 3.276 =C2=B1 0.005M/s ( 3.276M/s/cpu) uprobe-nop ( 2 cpus): 4.125 =C2=B1 0.002M/s ( 2.063M/s/cpu) uprobe-nop ( 4 cpus): 7.713 =C2=B1 0.002M/s ( 1.928M/s/cpu) uprobe-nop ( 8 cpus): 8.097 =C2=B1 0.006M/s ( 1.012M/s/cpu) uprobe-nop (16 cpus): 6.501 =C2=B1 0.056M/s ( 0.406M/s/cpu) uprobe-nop (32 cpus): 4.398 =C2=B1 0.084M/s ( 0.137M/s/cpu) uprobe-nop (64 cpus): 6.452 =C2=B1 0.000M/s ( 0.101M/s/cpu) uretprobe-nop ( 1 cpus): 2.055 =C2=B1 0.001M/s ( 2.055M/s/cpu) uretprobe-nop ( 2 cpus): 2.677 =C2=B1 0.000M/s ( 1.339M/s/cpu) uretprobe-nop ( 4 cpus): 4.561 =C2=B1 0.003M/s ( 1.140M/s/cpu) uretprobe-nop ( 8 cpus): 5.291 =C2=B1 0.002M/s ( 0.661M/s/cpu) uretprobe-nop (16 cpus): 5.065 =C2=B1 0.019M/s ( 0.317M/s/cpu) uretprobe-nop (32 cpus): 3.622 =C2=B1 0.003M/s ( 0.113M/s/cpu) uretprobe-nop (64 cpus): 3.723 =C2=B1 0.002M/s ( 0.058M/s/cpu) RCU Tasks Trace =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D uprobe-nop ( 1 cpus): 3.396 =C2=B1 0.002M/s ( 3.396M/s/cpu) uprobe-nop ( 2 cpus): 4.271 =C2=B1 0.006M/s ( 2.135M/s/cpu) uprobe-nop ( 4 cpus): 8.499 =C2=B1 0.015M/s ( 2.125M/s/cpu) uprobe-nop ( 8 cpus): 10.355 =C2=B1 0.028M/s ( 1.294M/s/cpu) uprobe-nop (16 cpus): 7.615 =C2=B1 0.099M/s ( 0.476M/s/cpu) uprobe-nop (32 cpus): 4.430 =C2=B1 0.007M/s ( 0.138M/s/cpu) uprobe-nop (64 cpus): 6.887 =C2=B1 0.020M/s ( 0.108M/s/cpu) uretprobe-nop ( 1 cpus): 2.174 =C2=B1 0.001M/s ( 2.174M/s/cpu) uretprobe-nop ( 2 cpus): 2.853 =C2=B1 0.001M/s ( 1.426M/s/cpu) uretprobe-nop ( 4 cpus): 4.913 =C2=B1 0.002M/s ( 1.228M/s/cpu) uretprobe-nop ( 8 cpus): 5.883 =C2=B1 0.002M/s ( 0.735M/s/cpu) uretprobe-nop (16 cpus): 5.147 =C2=B1 0.001M/s ( 0.322M/s/cpu) uretprobe-nop (32 cpus): 3.738 =C2=B1 0.008M/s ( 0.117M/s/cpu) uretprobe-nop (64 cpus): 4.397 =C2=B1 0.002M/s ( 0.069M/s/cpu) Peak throughput for uprobes increases from 8 mln/s to 10.3 mln/s (+28%!), and for uretprobes from 5.3 mln/s to 5.8 mln/s (+11%), as we have more work to do on uretprobes side. Even single-thread (no contention) performance is slightly better: 3.276 mln/s to 3.396 mln/s (+3.5%) for uprobes, and 2.055 mln/s to 2.174 mln/s (+5.8%) for uretprobes. We also select TASKS_TRACE_RCU for UPROBES in Kconfig due to this dependency. Reviewed-by: Masami Hiramatsu (Google) Reviewed-by: Oleg Nesterov Signed-off-by: Andrii Nakryiko --- arch/Kconfig | 1 + kernel/events/uprobes.c | 38 ++++++++++++++++---------------------- 2 files changed, 17 insertions(+), 22 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 975dd22a2dbd..a0df3f3dc484 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -126,6 +126,7 @@ config KPROBES_ON_FTRACE config UPROBES def_bool n depends on ARCH_SUPPORTS_UPROBES + select TASKS_TRACE_RCU help Uprobes is the user-space counterpart to kprobes: they enable instrumentation applications (such as 'perf probe') diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 4b7e590dc428..a2e6a57f79f2 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -26,6 +26,7 @@ #include #include #include +#include =20 #include =20 @@ -42,8 +43,6 @@ static struct rb_root uprobes_tree =3D RB_ROOT; static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ static seqcount_rwlock_t uprobes_seqcount =3D SEQCNT_RWLOCK_ZERO(uprobes_s= eqcount, &uprobes_treelock); =20 -DEFINE_STATIC_SRCU(uprobes_srcu); - #define UPROBES_HASH_SZ 13 /* serialize uprobe->pending_list */ static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ]; @@ -652,7 +651,7 @@ static void put_uprobe(struct uprobe *uprobe) delayed_uprobe_remove(uprobe, NULL); mutex_unlock(&delayed_uprobe_lock); =20 - call_srcu(&uprobes_srcu, &uprobe->rcu, uprobe_free_rcu); + call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu); } =20 static __always_inline @@ -707,7 +706,7 @@ static struct uprobe *find_uprobe_rcu(struct inode *ino= de, loff_t offset) struct rb_node *node; unsigned int seq; =20 - lockdep_assert(srcu_read_lock_held(&uprobes_srcu)); + lockdep_assert(rcu_read_lock_trace_held()); =20 do { seq =3D read_seqcount_begin(&uprobes_seqcount); @@ -935,8 +934,7 @@ static bool filter_chain(struct uprobe *uprobe, struct = mm_struct *mm) bool ret =3D false; =20 down_read(&uprobe->consumer_rwsem); - list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node, - srcu_read_lock_held(&uprobes_srcu)) { + list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_= trace_held()) { ret =3D consumer_filter(uc, mm); if (ret) break; @@ -1157,7 +1155,7 @@ void uprobe_unregister_sync(void) * unlucky enough caller can free consumer's memory and cause * handler_chain() or handle_uretprobe_chain() to do an use-after-free. */ - synchronize_srcu(&uprobes_srcu); + synchronize_rcu_tasks_trace(); } EXPORT_SYMBOL_GPL(uprobe_unregister_sync); =20 @@ -1241,19 +1239,18 @@ EXPORT_SYMBOL_GPL(uprobe_register); int uprobe_apply(struct uprobe *uprobe, struct uprobe_consumer *uc, bool a= dd) { struct uprobe_consumer *con; - int ret =3D -ENOENT, srcu_idx; + int ret =3D -ENOENT; =20 down_write(&uprobe->register_rwsem); =20 - srcu_idx =3D srcu_read_lock(&uprobes_srcu); - list_for_each_entry_srcu(con, &uprobe->consumers, cons_node, - srcu_read_lock_held(&uprobes_srcu)) { + rcu_read_lock_trace(); + list_for_each_entry_rcu(con, &uprobe->consumers, cons_node, rcu_read_lock= _trace_held()) { if (con =3D=3D uc) { ret =3D register_for_each_vma(uprobe, add ? uc : NULL); break; } } - srcu_read_unlock(&uprobes_srcu, srcu_idx); + rcu_read_unlock_trace(); =20 up_write(&uprobe->register_rwsem); =20 @@ -2123,8 +2120,7 @@ static void handler_chain(struct uprobe *uprobe, stru= ct pt_regs *regs) =20 current->utask->auprobe =3D &uprobe->arch; =20 - list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node, - srcu_read_lock_held(&uprobes_srcu)) { + list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_= trace_held()) { int rc =3D 0; =20 if (uc->handler) { @@ -2162,15 +2158,13 @@ handle_uretprobe_chain(struct return_instance *ri, = struct pt_regs *regs) { struct uprobe *uprobe =3D ri->uprobe; struct uprobe_consumer *uc; - int srcu_idx; =20 - srcu_idx =3D srcu_read_lock(&uprobes_srcu); - list_for_each_entry_srcu(uc, &uprobe->consumers, cons_node, - srcu_read_lock_held(&uprobes_srcu)) { + rcu_read_lock_trace(); + list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_= trace_held()) { if (uc->ret_handler) uc->ret_handler(uc, ri->func, regs); } - srcu_read_unlock(&uprobes_srcu, srcu_idx); + rcu_read_unlock_trace(); } =20 static struct return_instance *find_next_ret_chain(struct return_instance = *ri) @@ -2255,13 +2249,13 @@ static void handle_swbp(struct pt_regs *regs) { struct uprobe *uprobe; unsigned long bp_vaddr; - int is_swbp, srcu_idx; + int is_swbp; =20 bp_vaddr =3D uprobe_get_swbp_addr(regs); if (bp_vaddr =3D=3D uprobe_get_trampoline_vaddr()) return uprobe_handle_trampoline(regs); =20 - srcu_idx =3D srcu_read_lock(&uprobes_srcu); + rcu_read_lock_trace(); =20 uprobe =3D find_active_uprobe_rcu(bp_vaddr, &is_swbp); if (!uprobe) { @@ -2319,7 +2313,7 @@ static void handle_swbp(struct pt_regs *regs) =20 out: /* arch_uprobe_skip_sstep() succeeded, or restart if can't singlestep */ - srcu_read_unlock(&uprobes_srcu, srcu_idx); + rcu_read_unlock_trace(); } =20 /* --=20 2.43.5