From nobody Tue Apr 7 00:44:21 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C909ECAAD3 for ; Thu, 1 Sep 2022 08:13:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233907AbiIAINF (ORCPT ); Thu, 1 Sep 2022 04:13:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45302 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233775AbiIAIMq (ORCPT ); Thu, 1 Sep 2022 04:12:46 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 04F9B1299E6; Thu, 1 Sep 2022 01:12:45 -0700 (PDT) Date: Thu, 01 Sep 2022 08:12:42 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1662019963; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5ep2/2GVbcTjT95gVbliM4ZBzWW96TkKTOFbBnvqq8E=; b=yLuc7VnoqdDaE8/Vshpd6f+j39qsN3Qrhsr5GuFLIWES7N+mwI4+y94RP4/swQFI9r2k73 pDgwKTpGnQSTlElC63m1S81QpqBTiXy5K35KxvSH4VmZE/aCcRFg6YgPjRfbMR7Okxt3K1 XBpMeXo9oKoStQlYX5uscBufGMi4MU+GWFNKNajyX+BYT+W0QkV/c3otyVtA1fpOUdBtn1 mjp2XYHTIrk6cKwfM01UXXBTWOoN1Tdq/OtuT7NyrcLQEp/I0uH4KRO0ffW8Mj8ih+f6xz 5GpZiIeUvtrZG1MWNn2e0vjpkpCdgEpPh4btEeIPOyUvyFCKLHY4NVOTRwKpTw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1662019963; h=from:from:sender:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5ep2/2GVbcTjT95gVbliM4ZBzWW96TkKTOFbBnvqq8E=; b=3ndnVqCTgKQgdbFgicJonj6KBO8LuEc6t/mrDOkWEXxY3dUAI6dXmzmdA7L+VlNCG1GvI0 sjXN+8W3VC3931CQ== From: "tip-bot2 for Marco Elver" Sender: tip-bot2@linutronix.de Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: perf/core] perf/hw_breakpoint: Reduce contention with large number of tasks Cc: Marco Elver , "Peter Zijlstra (Intel)" , Dmitry Vyukov , Ian Rogers , x86@kernel.org, linux-kernel@vger.kernel.org In-Reply-To: <20220829124719.675715-12-elver@google.com> References: <20220829124719.675715-12-elver@google.com> MIME-Version: 1.0 Message-ID: <166201996238.401.6409571475329516346.tip-bot2@tip-bot2> Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following commit has been merged into the perf/core branch of tip: Commit-ID: 0912037fec1136d4e4796a3481f4a4ee09a2c325 Gitweb: https://git.kernel.org/tip/0912037fec1136d4e4796a3481f4a4ee0= 9a2c325 Author: Marco Elver AuthorDate: Mon, 29 Aug 2022 14:47:16 +02:00 Committer: Peter Zijlstra CommitterDate: Tue, 30 Aug 2022 10:56:24 +02:00 perf/hw_breakpoint: Reduce contention with large number of tasks While optimizing task_bp_pinned()'s runtime complexity to O(1) on average helps reduce time spent in the critical section, we still suffer due to serializing everything via 'nr_bp_mutex'. Indeed, a profile shows that now contention is the biggest issue: 95.93% [kernel] [k] osq_lock 0.70% [kernel] [k] mutex_spin_on_owner 0.22% [kernel] [k] smp_cfm_core_cond 0.18% [kernel] [k] task_bp_pinned 0.18% [kernel] [k] rhashtable_jhash2 0.15% [kernel] [k] queued_spin_lock_slowpath when running the breakpoint benchmark with (system with 256 CPUs): | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.207 [sec] | | 108.267188 usecs/op | 6929.100000 usecs/op/cpu The main concern for synchronizing the breakpoint constraints data is that a consistent snapshot of the per-CPU and per-task data is observed. The access pattern is as follows: 1. If the target is a task: the task's pinned breakpoints are counted, checked for space, and then appended to; only bp_cpuinfo::cpu_pinned is used to check for conflicts with CPU-only breakpoints; bp_cpuinfo::tsk_pinned are incremented/decremented, but otherwise unused. 2. If the target is a CPU: bp_cpuinfo::cpu_pinned are counted, along with bp_cpuinfo::tsk_pinned; after a successful check, cpu_pinned is incremented. No per-task breakpoints are checked. Since rhltable safely synchronizes insertions/deletions, we can allow concurrency as follows: 1. If the target is a task: independent tasks may update and check the constraints concurrently, but same-task target calls need to be serialized; since bp_cpuinfo::tsk_pinned is only updated, but not checked, these modifications can happen concurrently by switching tsk_pinned to atomic_t. 2. If the target is a CPU: access to the per-CPU constraints needs to be serialized with other CPU-target and task-target callers (to stabilize the bp_cpuinfo::tsk_pinned snapshot). We can allow the above concurrency by introducing a per-CPU constraints data reader-writer lock (bp_cpuinfo_sem), and per-task mutexes (reuses task_struct::perf_event_mutex): 1. If the target is a task: acquires perf_event_mutex, and acquires bp_cpuinfo_sem as a reader. The choice of percpu-rwsem minimizes contention in the presence of many read-lock but few write-lock acquisitions: we assume many orders of magnitude more task target breakpoints creations/destructions than CPU target breakpoints. 2. If the target is a CPU: acquires bp_cpuinfo_sem as a writer. With these changes, contention with thousands of tasks is reduced to the point where waiting on locking no longer dominates the profile: | $> perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.077 [sec] | | 40.201563 usecs/op | 2572.900000 usecs/op/cpu 21.54% [kernel] [k] task_bp_pinned 20.18% [kernel] [k] rhashtable_jhash2 6.81% [kernel] [k] toggle_bp_slot 5.47% [kernel] [k] queued_spin_lock_slowpath 3.75% [kernel] [k] smp_cfm_core_cond 3.48% [kernel] [k] bcmp On this particular setup that's a speedup of 2.7x. We're also getting closer to the theoretical ideal performance through optimizations in hw_breakpoint.c -- constraints accounting disabled: | perf bench -r 30 breakpoint thread -b 4 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 4 breakpoints and 64 parallelism | Total time: 0.067 [sec] | | 35.286458 usecs/op | 2258.333333 usecs/op/cpu Which means the current implementation is ~12% slower than the theoretical ideal. For reference, performance without any breakpoints: | $> bench -r 30 breakpoint thread -b 0 -p 64 -t 64 | # Running 'breakpoint/thread' benchmark: | # Created/joined 30 threads with 0 breakpoints and 64 parallelism | Total time: 0.060 [sec] | | 31.365625 usecs/op | 2007.400000 usecs/op/cpu On a system with 256 CPUs, the theoretical ideal is only ~12% slower than no breakpoints at all; the current implementation is ~28% slower. Signed-off-by: Marco Elver Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dmitry Vyukov Acked-by: Ian Rogers Link: https://lore.kernel.org/r/20220829124719.675715-12-elver@google.com --- kernel/events/hw_breakpoint.c | 161 +++++++++++++++++++++++++++------ 1 file changed, 133 insertions(+), 28 deletions(-) diff --git a/kernel/events/hw_breakpoint.c b/kernel/events/hw_breakpoint.c index 8b40fca..229c6f4 100644 --- a/kernel/events/hw_breakpoint.c +++ b/kernel/events/hw_breakpoint.c @@ -19,6 +19,7 @@ =20 #include =20 +#include #include #include #include @@ -28,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -41,9 +43,9 @@ struct bp_cpuinfo { unsigned int cpu_pinned; /* tsk_pinned[n] is the number of tasks having n+1 breakpoints */ #ifdef hw_breakpoint_slots - unsigned int tsk_pinned[hw_breakpoint_slots(0)]; + atomic_t tsk_pinned[hw_breakpoint_slots(0)]; #else - unsigned int *tsk_pinned; + atomic_t *tsk_pinned; #endif }; =20 @@ -65,8 +67,79 @@ static const struct rhashtable_params task_bps_ht_params= =3D { =20 static bool constraints_initialized __ro_after_init; =20 -/* Serialize accesses to the above constraints */ -static DEFINE_MUTEX(nr_bp_mutex); +/* + * Synchronizes accesses to the per-CPU constraints; the locking rules are: + * + * 1. Atomic updates to bp_cpuinfo::tsk_pinned only require a held read-l= ock + * (due to bp_slots_histogram::count being atomic, no update are lost). + * + * 2. Holding a write-lock is required for computations that require a + * stable snapshot of all bp_cpuinfo::tsk_pinned. + * + * 3. In all other cases, non-atomic accesses require the appropriately h= eld + * lock (read-lock for read-only accesses; write-lock for reads/writes= ). + */ +DEFINE_STATIC_PERCPU_RWSEM(bp_cpuinfo_sem); + +/* + * Return mutex to serialize accesses to per-task lists in task_bps_ht. Si= nce + * rhltable synchronizes concurrent insertions/deletions, independent task= s may + * insert/delete concurrently; therefore, a mutex per task is sufficient. + * + * Uses task_struct::perf_event_mutex, to avoid extending task_struct with= a + * hw_breakpoint-only mutex, which may be infrequently used. The caveat he= re is + * that hw_breakpoint may contend with per-task perf event list management= . The + * assumption is that perf usecases involving hw_breakpoints are very unli= kely + * to result in unnecessary contention. + */ +static inline struct mutex *get_task_bps_mutex(struct perf_event *bp) +{ + struct task_struct *tsk =3D bp->hw.target; + + return tsk ? &tsk->perf_event_mutex : NULL; +} + +static struct mutex *bp_constraints_lock(struct perf_event *bp) +{ + struct mutex *tsk_mtx =3D get_task_bps_mutex(bp); + + if (tsk_mtx) { + mutex_lock(tsk_mtx); + percpu_down_read(&bp_cpuinfo_sem); + } else { + percpu_down_write(&bp_cpuinfo_sem); + } + + return tsk_mtx; +} + +static void bp_constraints_unlock(struct mutex *tsk_mtx) +{ + if (tsk_mtx) { + percpu_up_read(&bp_cpuinfo_sem); + mutex_unlock(tsk_mtx); + } else { + percpu_up_write(&bp_cpuinfo_sem); + } +} + +static bool bp_constraints_is_locked(struct perf_event *bp) +{ + struct mutex *tsk_mtx =3D get_task_bps_mutex(bp); + + return percpu_is_write_locked(&bp_cpuinfo_sem) || + (tsk_mtx ? mutex_is_locked(tsk_mtx) : + percpu_is_read_locked(&bp_cpuinfo_sem)); +} + +static inline void assert_bp_constraints_lock_held(struct perf_event *bp) +{ + struct mutex *tsk_mtx =3D get_task_bps_mutex(bp); + + if (tsk_mtx) + lockdep_assert_held(tsk_mtx); + lockdep_assert_held(&bp_cpuinfo_sem); +} =20 #ifdef hw_breakpoint_slots /* @@ -97,7 +170,7 @@ static __init int init_breakpoint_slots(void) for (i =3D 0; i < TYPE_MAX; i++) { struct bp_cpuinfo *info =3D get_bp_info(cpu, i); =20 - info->tsk_pinned =3D kcalloc(__nr_bp_slots[i], sizeof(int), GFP_KERNEL); + info->tsk_pinned =3D kcalloc(__nr_bp_slots[i], sizeof(atomic_t), GFP_KE= RNEL); if (!info->tsk_pinned) goto err; } @@ -137,11 +210,19 @@ static inline enum bp_type_idx find_slot_idx(u64 bp_t= ype) */ static unsigned int max_task_bp_pinned(int cpu, enum bp_type_idx type) { - unsigned int *tsk_pinned =3D get_bp_info(cpu, type)->tsk_pinned; + atomic_t *tsk_pinned =3D get_bp_info(cpu, type)->tsk_pinned; int i; =20 + /* + * At this point we want to have acquired the bp_cpuinfo_sem as a + * writer to ensure that there are no concurrent writers in + * toggle_bp_task_slot() to tsk_pinned, and we get a stable snapshot. + */ + lockdep_assert_held_write(&bp_cpuinfo_sem); + for (i =3D hw_breakpoint_slots_cached(type) - 1; i >=3D 0; i--) { - if (tsk_pinned[i] > 0) + ASSERT_EXCLUSIVE_WRITER(tsk_pinned[i]); /* Catch unexpected writers. */ + if (atomic_read(&tsk_pinned[i]) > 0) return i + 1; } =20 @@ -158,6 +239,11 @@ static int task_bp_pinned(int cpu, struct perf_event *= bp, enum bp_type_idx type) struct perf_event *iter; int count =3D 0; =20 + /* + * We need a stable snapshot of the per-task breakpoint list. + */ + assert_bp_constraints_lock_held(bp); + rcu_read_lock(); head =3D rhltable_lookup(&task_bps_ht, &bp->hw.target, task_bps_ht_params= ); if (!head) @@ -214,16 +300,25 @@ max_bp_pinned_slots(struct perf_event *bp, enum bp_ty= pe_idx type) static void toggle_bp_task_slot(struct perf_event *bp, int cpu, enum bp_type_idx type, int weight) { - unsigned int *tsk_pinned =3D get_bp_info(cpu, type)->tsk_pinned; + atomic_t *tsk_pinned =3D get_bp_info(cpu, type)->tsk_pinned; int old_idx, new_idx; =20 + /* + * If bp->hw.target, tsk_pinned is only modified, but not used + * otherwise. We can permit concurrent updates as long as there are no + * other uses: having acquired bp_cpuinfo_sem as a reader allows + * concurrent updates here. Uses of tsk_pinned will require acquiring + * bp_cpuinfo_sem as a writer to stabilize tsk_pinned's value. + */ + lockdep_assert_held_read(&bp_cpuinfo_sem); + old_idx =3D task_bp_pinned(cpu, bp, type) - 1; new_idx =3D old_idx + weight; =20 if (old_idx >=3D 0) - tsk_pinned[old_idx]--; + atomic_dec(&tsk_pinned[old_idx]); if (new_idx >=3D 0) - tsk_pinned[new_idx]++; + atomic_inc(&tsk_pinned[new_idx]); } =20 /* @@ -241,6 +336,7 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enum= bp_type_idx type, =20 /* Pinned counter cpu profiling */ if (!bp->hw.target) { + lockdep_assert_held_write(&bp_cpuinfo_sem); get_bp_info(bp->cpu, type)->cpu_pinned +=3D weight; return 0; } @@ -249,6 +345,11 @@ toggle_bp_slot(struct perf_event *bp, bool enable, enu= m bp_type_idx type, for_each_cpu(cpu, cpumask) toggle_bp_task_slot(bp, cpu, type, weight); =20 + /* + * Readers want a stable snapshot of the per-task breakpoint list. + */ + assert_bp_constraints_lock_held(bp); + if (enable) return rhltable_insert(&task_bps_ht, &bp->hw.bp_list, task_bps_ht_params= ); else @@ -354,14 +455,10 @@ static int __reserve_bp_slot(struct perf_event *bp, u= 64 bp_type) =20 int reserve_bp_slot(struct perf_event *bp) { - int ret; - - mutex_lock(&nr_bp_mutex); - - ret =3D __reserve_bp_slot(bp, bp->attr.bp_type); - - mutex_unlock(&nr_bp_mutex); + struct mutex *mtx =3D bp_constraints_lock(bp); + int ret =3D __reserve_bp_slot(bp, bp->attr.bp_type); =20 + bp_constraints_unlock(mtx); return ret; } =20 @@ -379,12 +476,11 @@ static void __release_bp_slot(struct perf_event *bp, = u64 bp_type) =20 void release_bp_slot(struct perf_event *bp) { - mutex_lock(&nr_bp_mutex); + struct mutex *mtx =3D bp_constraints_lock(bp); =20 arch_unregister_hw_breakpoint(bp); __release_bp_slot(bp, bp->attr.bp_type); - - mutex_unlock(&nr_bp_mutex); + bp_constraints_unlock(mtx); } =20 static int __modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_t= ype) @@ -411,11 +507,10 @@ static int __modify_bp_slot(struct perf_event *bp, u6= 4 old_type, u64 new_type) =20 static int modify_bp_slot(struct perf_event *bp, u64 old_type, u64 new_typ= e) { - int ret; + struct mutex *mtx =3D bp_constraints_lock(bp); + int ret =3D __modify_bp_slot(bp, old_type, new_type); =20 - mutex_lock(&nr_bp_mutex); - ret =3D __modify_bp_slot(bp, old_type, new_type); - mutex_unlock(&nr_bp_mutex); + bp_constraints_unlock(mtx); return ret; } =20 @@ -426,18 +521,28 @@ static int modify_bp_slot(struct perf_event *bp, u64 = old_type, u64 new_type) */ int dbg_reserve_bp_slot(struct perf_event *bp) { - if (mutex_is_locked(&nr_bp_mutex)) + int ret; + + if (bp_constraints_is_locked(bp)) return -1; =20 - return __reserve_bp_slot(bp, bp->attr.bp_type); + /* Locks aren't held; disable lockdep assert checking. */ + lockdep_off(); + ret =3D __reserve_bp_slot(bp, bp->attr.bp_type); + lockdep_on(); + + return ret; } =20 int dbg_release_bp_slot(struct perf_event *bp) { - if (mutex_is_locked(&nr_bp_mutex)) + if (bp_constraints_is_locked(bp)) return -1; =20 + /* Locks aren't held; disable lockdep assert checking. */ + lockdep_off(); __release_bp_slot(bp, bp->attr.bp_type); + lockdep_on(); =20 return 0; } @@ -663,7 +768,7 @@ bool hw_breakpoint_is_used(void) return true; =20 for (int slot =3D 0; slot < hw_breakpoint_slots_cached(type); ++slot) { - if (info->tsk_pinned[slot]) + if (atomic_read(&info->tsk_pinned[slot])) return true; } }