From nobody Fri Dec 19 17:33:05 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B351227D795 for ; Fri, 11 Apr 2025 05:41:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744350076; cv=none; b=Phg3LZEbkIgGKj7orJWQqNsoeLUuXiKJK/1jubVdZoDlovtb3WWk2BqOQNT2Rp88qu3q/sgDPWP3ypWASaAU3+Lb1QS6IdPPSp523LkAR2XxqAB1KDyP22y7w5bpECtXZecH5qZFgBetLBq0nBKO9qn4IPny2lclEzgYF55HIp8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744350076; c=relaxed/simple; bh=NO6FtaESPRYJpWa3pRYGQg6ozRF+SKuxjJZ4vEwmzXs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kN9KF0l9+1/Zc/98HVCQwh8SMhtZt/YSmsYmxo0PvY6EMn1wlBuu3lfaK1YX/LFn0HXLls5w1dJOhT2UYWP818njK/+ZoxE5Y/ahRSIl6G+T/v3W4pnVZO9oAL4kQy5ZlVhPLTAYbF/mk7uZzJhO29YTFlkqHyxtuNOqXKPcPiY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=j+fN8i83; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="j+fN8i83" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8239AC4CEE5; Fri, 11 Apr 2025 05:41:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1744350074; bh=NO6FtaESPRYJpWa3pRYGQg6ozRF+SKuxjJZ4vEwmzXs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=j+fN8i83+EiBWRYSIjcpxnd01I350XamVr7qFHr6qL+lWYjK+YBgxWdpKJjldmPS9 VLJym9W4TLlxWUHFz7tIwH3pt/tP2HgYpvQ6OtGJ09Nx0zB0ktFdUsaAVgga7WPUrs ZykWoJVP324IZlK9h5WBDo/RgDS/a8XSDXsNIvVDadcIo6iDqduX4Q02QURP/N8yWX QyzJwN/WoJRVhChw5WaJBt1AdaypGRE18zIhJpPM0sPb9RIDDylecNqC6Mlye4GXmO Fa6BoVQp1EuglR1ZLboWkpOWcqbzeu+qPbGNJxGpDvpITCJI86fOtkq0LTJbCKW+m9 nER0eSJlmYprg== From: Ingo Molnar To: linux-kernel@vger.kernel.org Cc: Juergen Gross , "H . Peter Anvin" , Linus Torvalds , Peter Zijlstra , Borislav Petkov , Thomas Gleixner , Eric Dumazet , Ingo Molnar , Brian Gerst , Kees Cook , Josh Poimboeuf Subject: [PATCH 01/53] x86/alternatives: Improve code-patching scalability by removing false sharing in poke_int3_handler() Date: Fri, 11 Apr 2025 07:40:13 +0200 Message-ID: <20250411054105.2341982-2-mingo@kernel.org> X-Mailer: git-send-email 2.45.2 In-Reply-To: <20250411054105.2341982-1-mingo@kernel.org> References: <20250411054105.2341982-1-mingo@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Eric Dumazet eBPF programs can be run 50,000,000 times per second on busy servers. Whenever /proc/sys/kernel/bpf_stats_enabled is turned off, hundreds of calls sites are patched from text_poke_bp_batch() and we see a huge loss of performance due to false sharing on bp_desc.refs lasting up to three seconds. 51.30% server_bin [kernel.kallsyms] [k] poke_int3_handl= er | |--46.45%--poke_int3_handler | exc_int3 | asm_exc_int3 | | | |--24.26%--cls_bpf_classify | | tcf_classify | | __dev_queue_xmit | | ip6_finish_output2 | | ip6_output | | ip6_xmit | | inet6_csk_xmit | | __tcp_transmit_skb Fix this by replacing bp_desc.refs with a per-cpu bp_refs. Before the patch, on a host with 240 cores (480 threads): $ sysctl -wq kernel.bpf_stats_enabled=3D0 text_poke_bp_batch(nr_entries=3D164) : Took 2655300 usec $ bpftool prog | grep run_time_ns ... 105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns 3009063719 run_cnt 82757845 : average cost is 36 nsec per call After this patch: $ sysctl -wq kernel.bpf_stats_enabled=3D0 text_poke_bp_batch(nr_entries=3D164) : Took 702 usec $ bpftool prog | grep run_time_ns ... 105: sched_cls name hn_egress tag 699fc5eea64144e3 gpl run_time_ns 1928223019 run_cnt 67682728 : average cost is 28 nsec per call Ie. text-patching performance improved 3700x: from 2.65 seconds to 0.0007 seconds. Since the atomic_cond_read_acquire(refs, !VAL) spin-loop was not triggered even once in my tests, add an unlikely() annotation, because this appears to be the common case. [ mingo: Improved the changelog some more. ] Signed-off-by: Eric Dumazet Signed-off-by: Ingo Molnar Cc: Brian Gerst Cc: Juergen Gross Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Kees Cook Cc: Peter Zijlstra Cc: Josh Poimboeuf Link: https://lore.kernel.org/r/20250325043316.874518-1-edumazet@google.com --- arch/x86/kernel/alternative.c | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index bf82c6f7d690..85089c79a828 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -2474,28 +2474,29 @@ struct text_poke_loc { struct bp_patching_desc { struct text_poke_loc *vec; int nr_entries; - atomic_t refs; }; =20 +static DEFINE_PER_CPU(atomic_t, bp_refs); + static struct bp_patching_desc bp_desc; =20 static __always_inline struct bp_patching_desc *try_get_desc(void) { - struct bp_patching_desc *desc =3D &bp_desc; + atomic_t *refs =3D this_cpu_ptr(&bp_refs); =20 - if (!raw_atomic_inc_not_zero(&desc->refs)) + if (!raw_atomic_inc_not_zero(refs)) return NULL; =20 - return desc; + return &bp_desc; } =20 static __always_inline void put_desc(void) { - struct bp_patching_desc *desc =3D &bp_desc; + atomic_t *refs =3D this_cpu_ptr(&bp_refs); =20 smp_mb__before_atomic(); - raw_atomic_dec(&desc->refs); + raw_atomic_dec(refs); } =20 static __always_inline void *text_poke_addr(struct text_poke_loc *tp) @@ -2528,9 +2529,9 @@ noinstr int poke_int3_handler(struct pt_regs *regs) * Having observed our INT3 instruction, we now must observe * bp_desc with non-zero refcount: * - * bp_desc.refs =3D 1 INT3 - * WMB RMB - * write INT3 if (bp_desc.refs !=3D 0) + * bp_refs =3D 1 INT3 + * WMB RMB + * write INT3 if (bp_refs !=3D 0) */ smp_rmb(); =20 @@ -2636,7 +2637,8 @@ static void text_poke_bp_batch(struct text_poke_loc *= tp, unsigned int nr_entries * Corresponds to the implicit memory barrier in try_get_desc() to * ensure reading a non-zero refcount provides up to date bp_desc data. */ - atomic_set_release(&bp_desc.refs, 1); + for_each_possible_cpu(i) + atomic_set_release(per_cpu_ptr(&bp_refs, i), 1); =20 /* * Function tracing can enable thousands of places that need to be @@ -2750,8 +2752,12 @@ static void text_poke_bp_batch(struct text_poke_loc = *tp, unsigned int nr_entries /* * Remove and wait for refs to be zero. */ - if (!atomic_dec_and_test(&bp_desc.refs)) - atomic_cond_read_acquire(&bp_desc.refs, !VAL); + for_each_possible_cpu(i) { + atomic_t *refs =3D per_cpu_ptr(&bp_refs, i); + + if (unlikely(!atomic_dec_and_test(refs))) + atomic_cond_read_acquire(refs, !VAL); + } } =20 static void text_poke_loc_init(struct text_poke_loc *tp, void *addr, --=20 2.45.2