From nobody Wed Sep 10 01:53:14 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1F1EC31280C; Mon, 8 Sep 2025 17:14:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351679; cv=none; b=ZteQTSgCMLcneHIAOFFmSXaLTcrlhcNEIzsUZM+pzFzpHHU76EnBElibluB8L1J4IiouLlFGl2qY3e30qT/gkZDc93Z/nwK+brChyzJXSYz59sCu0bQ3gJNUW50TWoN7mBr9pUcOz6NcBYm24kqoRMGuD3IT7cRX4+BZT1Glogw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351679; c=relaxed/simple; bh=CaWPYtnuxMn+9lxdbGHUKRlw7zTyxsoiVJ8Bkz7tB84=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=qIb3U7uZNVL5CCAP2rMyFDC6SBvM8jUThCR1qFT9LVUfXShshv09SbsAzcIVI4qA4i8NeJlM9aiz8YoPgsMcru50FTmmvYKk8jYcHsrBSdgI7xMcBLamummtkq2hi34UIJ4TRFqIvSqyybAiOIQRMlXw6lfgNv/95b2uoBvR+Oc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qwsWxK1U; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qwsWxK1U" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 99734C4CEF9; Mon, 8 Sep 2025 17:14:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757351678; bh=CaWPYtnuxMn+9lxdbGHUKRlw7zTyxsoiVJ8Bkz7tB84=; h=Date:From:To:Cc:Subject:References:From; b=qwsWxK1U9faOxydz4T6nvPdMcK3dkWrhoWIAdvkhmSAHqHrTFKtuNOFhzYtY2ydCy v8dDKz1fz7gHto0yC8KUlCh19Wk56mDilp+T8/vSeuGdo8SAr7Gq+VRWACzOaGRkOd CZ5ba5BV5FKZvE/R/wkXnkcx7twKFStRAWezoP0nPmlIf/WIYCq4cNTpJZyCfdiyep LlxBOsRs2KyYTdJNwXE8OgHaSPINSQWN+0xihYXsHr6PSW1kJzhzLx+RGljGNc9ccH C2YyGAX4Ph1h5n1+kSL94o7bQNOsWC/RopYDX2nTBfBrnhsXHmVzc+mukAxciTdDmO q16gyzZ08nTFQ== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uvfSe-000000075OU-2S6q; Mon, 08 Sep 2025 13:15:24 -0400 Message-ID: <20250908171524.435994255@kernel.org> User-Agent: quilt/0.68 Date: Mon, 08 Sep 2025 13:14:13 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org, x86@kernel.org Cc: Masami Hiramatsu , Mathieu Desnoyers , Josh Poimboeuf , Peter Zijlstra , Ingo Molnar , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Linus Torvalds , Andrew Morton , Florian Weimer , Sam James , Kees Cook , "Carlos O'Donell" Subject: [RESEND][PATCH v15 1/4] unwind deferred: Add unwind_user_get_cookie() API References: <20250908171412.268168931@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Add the function unwind_user_get_cookie() API that allows a subsystem to retrieve the current context cookie. This can be used by perf to attach a cookie to its task deferred unwinding code that doesn't use the deferred unwind logic. Signed-off-by: Steven Rostedt (Google) --- include/linux/unwind_deferred.h | 5 +++++ kernel/unwind/deferred.c | 21 +++++++++++++++++++++ 2 files changed, 26 insertions(+) diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h index 26122d00708a..ce507495972c 100644 --- a/include/linux/unwind_deferred.h +++ b/include/linux/unwind_deferred.h @@ -41,6 +41,8 @@ void unwind_deferred_cancel(struct unwind_work *work); =20 void unwind_deferred_task_exit(struct task_struct *task); =20 +u64 unwind_user_get_cookie(void); + static __always_inline void unwind_reset_info(void) { struct unwind_task_info *info =3D ¤t->unwind_info; @@ -76,6 +78,9 @@ static inline void unwind_deferred_cancel(struct unwind_w= ork *work) {} static inline void unwind_deferred_task_exit(struct task_struct *task) {} static inline void unwind_reset_info(void) {} =20 +/* Must be non-zero */ +static inline u64 unwind_user_get_cookie(void) { return (u64)-1; } + #endif /* !CONFIG_UNWIND_USER */ =20 #endif /* _LINUX_UNWIND_USER_DEFERRED_H */ diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index dc6040aae3ee..90f90e30000a 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -94,6 +94,27 @@ static u64 get_cookie(struct unwind_task_info *info) return info->id.id; } =20 +/** + * unwind_user_get_cookie - Get the current user context cookie + * + * This is used to get a unique context cookie for the current task. + * Every time a task enters the kernel it has a new context. If + * a subsystem needs to have a unique identifier for that context for + * the current task, it can call this function to retrieve a unique + * cookie for that task context. + * + * Returns: A unque identifier for the current task user context. + */ +u64 unwind_user_get_cookie(void) +{ + struct unwind_task_info *info =3D ¤t->unwind_info; + + guard(irqsave)(); + /* Make sure to clear the info->id.id when exiting the kernel */ + set_bit(UNWIND_USED_BIT, &info->unwind_mask); + return get_cookie(info); +} + /** * unwind_user_faultable - Produce a user stacktrace in faultable context * @trace: The descriptor that will store the user stacktrace --=20 2.50.1 From nobody Wed Sep 10 01:53:14 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 98BDF3128BD; Mon, 8 Sep 2025 17:14:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351679; cv=none; b=C6U/OOTBctSOA+xfhAJLrTneoQkjymUx2x276sgCEmr31azb/S5T+s0ZcBjT4kMqC7qi0XP+3rLCN9hJchPznmKBEJ5d+IIy3AvOJvdIUcyWClfALoZkRf2nwYqfE+hKuZJuUk09ZLIG+uqIzhPGKQnYINkLb1gFOdRJlZp6Rtk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351679; c=relaxed/simple; bh=rvzyhqxcwmntzMrODFZaV3erRoX14FmAXF8ssBx8/1U=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=FPz5pyaHTN9aw6jCzQP2v2NkkP4so4r7fC/u28KwfKIPcuFApVzFN8mYfy1HEu+9U7HsaaDs9pwkhgEKFjBSnsO/eO2Z+kgiBXUS8P0J1wo11ZS1ztnrjw3Jo2tba4NDsMAzT3/f1LGu6nd+dtdApVXsxRnCGQGhv6Zol9POcGg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=WpUeP1b4; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="WpUeP1b4" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D2912C116D0; Mon, 8 Sep 2025 17:14:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757351679; bh=rvzyhqxcwmntzMrODFZaV3erRoX14FmAXF8ssBx8/1U=; h=Date:From:To:Cc:Subject:References:From; b=WpUeP1b4DntxHXBLLsdZ7XdrNyF4m/YQ+coCKAPWBekO5I4VKKrn/t2V68xhCbwnc JrDgiwyK57jFOmHRPFLG7s8FdKHfR+IMVyhpj7PubuO17EWpD8SD9sUDTER0+lrGoP KlRuKLoswM9x6fsVOYTZfxNqcHVZArKsQ7FTHEnz/cmlWva3kcdRNGGV7vXK8rFKjb ey/0i3BRKD+4+OpwZZg6QfDDVDUl5GKKUvGVOeK9n8j+6jWMmogsDIacKM48TgBFRo bOiOI6i5dKDI2BVMSR8yaEQ41rx1ZsFufd21CvIY4bms2dpX+FnSLCmwZ95R8b/zje eJHP7wpQ+qw9w== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uvfSe-000000075Oz-3BGQ; Mon, 08 Sep 2025 13:15:24 -0400 Message-ID: <20250908171524.605637238@kernel.org> User-Agent: quilt/0.68 Date: Mon, 08 Sep 2025 13:14:14 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org, x86@kernel.org Cc: Masami Hiramatsu , Mathieu Desnoyers , Josh Poimboeuf , Peter Zijlstra , Ingo Molnar , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Linus Torvalds , Andrew Morton , Florian Weimer , Sam James , Kees Cook , "Carlos O'Donell" Subject: [RESEND][PATCH v15 2/4] perf: Support deferred user callchains References: <20250908171412.268168931@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Josh Poimboeuf If the user fault unwind is available (the one that will be used for sframes), have perf be able to utilize it. Currently all user stack traces are done at the request site. This mostly happens in interrupt or NMI context where user space is only accessible if it is currently present in memory. It is possible that the user stack was swapped out and is not present, but mostly the use of sframes will require faulting in user pages which will not be possible from interrupt context. Instead, add a frame work that will delay the reading of the user space stack until the task goes back to user space where faulting in pages is possible. This is also advantageous as the user space stack doesn't change while in the kernel, and this will also remove duplicate entries of user space stacks for a long running system call being profiled. A new perf context is created called PERF_CONTEXT_USER_DEFERRED. It is added to the kernel callchain, usually when an interrupt or NMI is triggered (but can be added to any callchain). When a deferred unwind is required, a new task_work is triggered (pending_unwind_work) on the task. The callchain that is done immediately for the kernel is appended with the PERF_CONTEXT_USER_DEFERRED. When the task exits to user space and the task_work handler is triggered, it will execute the user stack unwinding and record the user stack trace. This user stack trace will go into a new perf type called PERF_RECORD_CALLCHAIN_DEFERRED. The perf user space will need to attach this stack trace to each of the previous kernel callchains for that task with the PERF_CONTEXT_USER_DEFERRED context in them. As the struct unwind_stacktrace has its entries as "unsigned long", and it is used to copy directly into struct perf_callchain_entry which its "ip" field is defined as u64, currently only deferred callchains are allowed for 64bit architectures. This could change in the future if there is a demand for it for 32 bit architectures. Suggested-by: Peter Zijlstra Co-developed-by: Steven Rostedt (Google) Signed-off-by: Josh Poimboeuf Signed-off-by: Steven Rostedt (Google) --- include/linux/perf_event.h | 7 +- include/uapi/linux/perf_event.h | 20 +++- kernel/bpf/stackmap.c | 4 +- kernel/events/callchain.c | 11 +- kernel/events/core.c | 156 +++++++++++++++++++++++++- tools/include/uapi/linux/perf_event.h | 20 +++- 6 files changed, 210 insertions(+), 8 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index fd1d91017b99..1527afa952f7 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -53,6 +53,7 @@ #include #include #include +#include =20 #include =20 @@ -880,6 +881,10 @@ struct perf_event { struct callback_head pending_task; unsigned int pending_work; =20 + unsigned int pending_unwind_callback; + struct callback_head pending_unwind_work; + struct rcuwait pending_unwind_wait; + atomic_t event_limit; =20 /* address range filters */ @@ -1720,7 +1725,7 @@ extern void perf_callchain_user(struct perf_callchain= _entry_ctx *entry, struct p extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, = struct pt_regs *regs); extern struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, - u32 max_stack, bool crosstask, bool add_mark); + u32 max_stack, bool crosstask, bool add_mark, bool defer_user); extern int get_callchain_buffers(int max_stack); extern void put_callchain_buffers(void); extern struct perf_callchain_entry *get_callchain_entry(int *rctx); diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_even= t.h index 78a362b80027..20b8f890113b 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -463,7 +463,8 @@ struct perf_event_attr { inherit_thread : 1, /* children only inherit if cloned with CLONE_THR= EAD */ remove_on_exec : 1, /* event is removed from task on exec */ sigtrap : 1, /* send synchronous SIGTRAP on event */ - __reserved_1 : 26; + defer_callchain: 1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED record= s */ + __reserved_1 : 25; =20 union { __u32 wakeup_events; /* wake up every n events */ @@ -1239,6 +1240,22 @@ enum perf_event_type { */ PERF_RECORD_AUX_OUTPUT_HW_ID =3D 21, =20 + /* + * This user callchain capture was deferred until shortly before + * returning to user space. Previous samples would have kernel + * callchains only and they need to be stitched with this to make full + * callchains. + * + * struct { + * struct perf_event_header header; + * u64 cookie; + * u64 nr; + * u64 ips[nr]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_CALLCHAIN_DEFERRED =3D 22, + PERF_RECORD_MAX, /* non-ABI */ }; =20 @@ -1269,6 +1286,7 @@ enum perf_callchain_context { PERF_CONTEXT_HV =3D (__u64)-32, PERF_CONTEXT_KERNEL =3D (__u64)-128, PERF_CONTEXT_USER =3D (__u64)-512, + PERF_CONTEXT_USER_DEFERRED =3D (__u64)-640, =20 PERF_CONTEXT_GUEST =3D (__u64)-2048, PERF_CONTEXT_GUEST_KERNEL =3D (__u64)-2176, diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index ec3a57a5fba1..339f7cbbcf36 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, str= uct bpf_map *, map, max_depth =3D sysctl_perf_event_max_stack; =20 trace =3D get_perf_callchain(regs, kernel, user, max_depth, - false, false); + false, false, false); =20 if (unlikely(!trace)) /* couldn't fetch the stack trace */ @@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struc= t task_struct *task, trace =3D get_callchain_entry_for_task(task, max_depth); else trace =3D get_perf_callchain(regs, kernel, user, max_depth, - crosstask, false); + crosstask, false, false); =20 if (unlikely(!trace) || trace->nr < skip) { if (may_fault) diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index 808c0d7a31fa..d0e0da66a164 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct p= erf_callchain_entry *entr =20 struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, - u32 max_stack, bool crosstask, bool add_mark) + u32 max_stack, bool crosstask, bool add_mark, bool defer_user) { struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; @@ -251,6 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, = bool user, regs =3D task_pt_regs(current); } =20 + if (defer_user) { + /* + * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED + * which can be stitched to this one. + */ + perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED); + goto exit_put; + } + if (add_mark) perf_callchain_store_context(&ctx, PERF_CONTEXT_USER); =20 diff --git a/kernel/events/core.c b/kernel/events/core.c index 28de3baff792..37e684edbc8a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5582,6 +5582,95 @@ static bool exclusive_event_installable(struct perf_= event *event, return true; } =20 +static void perf_pending_unwind_sync(struct perf_event *event) +{ + might_sleep(); + + if (!event->pending_unwind_callback) + return; + + /* + * If the task is queued to the current task's queue, we + * obviously can't wait for it to complete. Simply cancel it. + */ + if (task_work_cancel(current, &event->pending_unwind_work)) { + event->pending_unwind_callback =3D 0; + local_dec(&event->ctx->nr_no_switch_fast); + return; + } + + /* + * All accesses related to the event are within the same RCU section in + * perf_event_callchain_deferred(). The RCU grace period before the + * event is freed will make sure all those accesses are complete by then. + */ + rcuwait_wait_event(&event->pending_unwind_wait, !event->pending_unwind_ca= llback, TASK_UNINTERRUPTIBLE); +} + +struct perf_callchain_deferred_event { + struct perf_event_header header; + u64 cookie; + u64 nr; + u64 ips[]; +}; + +static void perf_event_callchain_deferred(struct callback_head *work) +{ + struct perf_event *event =3D container_of(work, struct perf_event, pendin= g_unwind_work); + struct perf_callchain_deferred_event deferred_event; + u64 callchain_context =3D PERF_CONTEXT_USER; + struct unwind_stacktrace trace; + struct perf_output_handle handle; + struct perf_sample_data data; + u64 nr; + + if (!event->pending_unwind_callback) + return; + + if (unwind_user_faultable(&trace) < 0) + goto out; + + /* + * All accesses to the event must belong to the same implicit RCU + * read-side critical section as the ->pending_unwind_callback reset. + * See comment in perf_pending_unwind_sync(). + */ + guard(rcu)(); + + if (current->flags & (PF_KTHREAD | PF_USER_WORKER)) + goto out; + + nr =3D trace.nr + 1 ; /* '+1' =3D=3D callchain_context */ + + deferred_event.header.type =3D PERF_RECORD_CALLCHAIN_DEFERRED; + deferred_event.header.misc =3D PERF_RECORD_MISC_USER; + deferred_event.header.size =3D sizeof(deferred_event) + (nr * sizeof(u64)= ); + + deferred_event.nr =3D nr; + deferred_event.cookie =3D unwind_user_get_cookie(); + + perf_event_header__init_id(&deferred_event.header, &data, event); + + if (perf_output_begin(&handle, &data, event, deferred_event.header.size)) + goto out; + + perf_output_put(&handle, deferred_event); + perf_output_put(&handle, callchain_context); + /* trace.entries[] are not guaranteed to be 64bit */ + for (int i =3D 0; i < trace.nr; i++) { + u64 entry =3D trace.entries[i]; + perf_output_put(&handle, entry); + } + perf_event__output_id_sample(event, &handle, &data); + + perf_output_end(&handle); + +out: + event->pending_unwind_callback =3D 0; + local_dec(&event->ctx->nr_no_switch_fast); + rcuwait_wake_up(&event->pending_unwind_wait); +} + static void perf_free_addr_filters(struct perf_event *event); =20 /* vs perf_event_alloc() error */ @@ -5649,6 +5738,7 @@ static void _free_event(struct perf_event *event) { irq_work_sync(&event->pending_irq); irq_work_sync(&event->pending_disable_irq); + perf_pending_unwind_sync(event); =20 unaccount_event(event); =20 @@ -8194,6 +8284,46 @@ static u64 perf_get_page_size(unsigned long addr) =20 static struct perf_callchain_entry __empty_callchain =3D { .nr =3D 0, }; =20 +/* + * Returns: +* > 0 : if already queued. + * 0 : if it performed the queuing + * < 0 : if it did not get queued. + */ +static int deferred_request(struct perf_event *event) +{ + struct callback_head *work =3D &event->pending_unwind_work; + int pending; + int ret; + + /* Only defer for task events */ + if (!event->ctx->task) + return -EINVAL; + + if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) || + !user_mode(task_pt_regs(current))) + return -EINVAL; + + guard(irqsave)(); + + /* callback already pending? */ + pending =3D READ_ONCE(event->pending_unwind_callback); + if (pending) + return 1; + + /* Claim the work unless an NMI just now swooped in to do so. */ + if (!try_cmpxchg(&event->pending_unwind_callback, &pending, 1)) + return 1; + + /* The work has been claimed, now schedule it. */ + ret =3D task_work_add(current, work, TWA_RESUME); + if (WARN_ON_ONCE(ret)) { + WRITE_ONCE(event->pending_unwind_callback, 0); + return ret; + } + return 0; +} + struct perf_callchain_entry * perf_callchain(struct perf_event *event, struct pt_regs *regs) { @@ -8204,6 +8334,9 @@ perf_callchain(struct perf_event *event, struct pt_re= gs *regs) bool crosstask =3D event->ctx->task && event->ctx->task !=3D current; const u32 max_stack =3D event->attr.sample_max_stack; struct perf_callchain_entry *callchain; + /* perf currently only supports deferred in 64bit */ + bool defer_user =3D IS_ENABLED(CONFIG_UNWIND_USER) && user && + event->attr.defer_callchain; =20 if (!current->mm) user =3D false; @@ -8211,8 +8344,21 @@ perf_callchain(struct perf_event *event, struct pt_r= egs *regs) if (!kernel && !user) return &__empty_callchain; =20 - callchain =3D get_perf_callchain(regs, kernel, user, - max_stack, crosstask, true); + /* Disallow cross-task callchains. */ + if (event->ctx->task && event->ctx->task !=3D current) + return &__empty_callchain; + + if (defer_user) { + int ret =3D deferred_request(event); + if (!ret) + local_inc(&event->ctx->nr_no_switch_fast); + else if (ret < 0) + defer_user =3D false; + } + + callchain =3D get_perf_callchain(regs, kernel, user, max_stack, + crosstask, true, defer_user); + return callchain ?: &__empty_callchain; } =20 @@ -12882,6 +13028,8 @@ perf_event_alloc(struct perf_event_attr *attr, int = cpu, event->pending_disable_irq =3D IRQ_WORK_INIT_HARD(perf_pending_disable); init_task_work(&event->pending_task, perf_pending_task); =20 + rcuwait_init(&event->pending_unwind_wait); + mutex_init(&event->mmap_mutex); raw_spin_lock_init(&event->addr_filters.lock); =20 @@ -13050,6 +13198,10 @@ perf_event_alloc(struct perf_event_attr *attr, int= cpu, if (err) return ERR_PTR(err); =20 + if (event->attr.defer_callchain) + init_task_work(&event->pending_unwind_work, + perf_event_callchain_deferred); + /* symmetric to unaccount_event() in _free_event() */ account_event(event); =20 diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/lin= ux/perf_event.h index 78a362b80027..20b8f890113b 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -463,7 +463,8 @@ struct perf_event_attr { inherit_thread : 1, /* children only inherit if cloned with CLONE_THR= EAD */ remove_on_exec : 1, /* event is removed from task on exec */ sigtrap : 1, /* send synchronous SIGTRAP on event */ - __reserved_1 : 26; + defer_callchain: 1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED record= s */ + __reserved_1 : 25; =20 union { __u32 wakeup_events; /* wake up every n events */ @@ -1239,6 +1240,22 @@ enum perf_event_type { */ PERF_RECORD_AUX_OUTPUT_HW_ID =3D 21, =20 + /* + * This user callchain capture was deferred until shortly before + * returning to user space. Previous samples would have kernel + * callchains only and they need to be stitched with this to make full + * callchains. + * + * struct { + * struct perf_event_header header; + * u64 cookie; + * u64 nr; + * u64 ips[nr]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_CALLCHAIN_DEFERRED =3D 22, + PERF_RECORD_MAX, /* non-ABI */ }; =20 @@ -1269,6 +1286,7 @@ enum perf_callchain_context { PERF_CONTEXT_HV =3D (__u64)-32, PERF_CONTEXT_KERNEL =3D (__u64)-128, PERF_CONTEXT_USER =3D (__u64)-512, + PERF_CONTEXT_USER_DEFERRED =3D (__u64)-640, =20 PERF_CONTEXT_GUEST =3D (__u64)-2048, PERF_CONTEXT_GUEST_KERNEL =3D (__u64)-2176, --=20 2.50.1 From nobody Wed Sep 10 01:53:14 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6433531283D; Mon, 8 Sep 2025 17:14:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351679; cv=none; b=uSwLwG0QPzRRmS76V1M1LU9uobb8pEYsIwL87ryndqNeUTdmtGIATYMiaSc11rxdEw38bxFBYznqM7SXg8is8slOxTJFid+8VGKWf1Dnnl+Ifg4QwX6HnC5+iT68zDCSWZLrr4zBTjTg4boRz3Is3K4IjwGTTY1QGoBKTAX+JZ0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351679; c=relaxed/simple; bh=lzpMzNydbQdbjNpCoh02YRIq57GaVmw2adJ61JT6QHk=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=UZuSKJhTb3runYGI0TF1mW8I9FQrKKNrEfkd0NB/vOMMY71vXU/XpkA8ZVnWbUuos9h1VuVXXFTUswG+4z8sI503kEVknIJoWDOBeuY6LfpJakETt1clhSNzNhfZphUmxG5w8coVt35/GZEitry1382lakN5xbHO1/iVeA006n8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=tuknqg/y; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="tuknqg/y" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D67D9C116C6; Mon, 8 Sep 2025 17:14:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757351679; bh=lzpMzNydbQdbjNpCoh02YRIq57GaVmw2adJ61JT6QHk=; h=Date:From:To:Cc:Subject:References:From; b=tuknqg/yoQn/qo1m8iyjNJJjv8fbgI5tWBW9yDLzNACfbHlGt5u1rov9TLdKwPQC4 W96ZW+kwn+MngtV5Wilo+G95yOiXBb2y3y0ev7zn4BV9I+mWZQ/DsGg1ZUjtJ1f2bw AJo70Un+ccxXis9uAF7EXMR7gi7iIS3Xn9Y9ULGAyFPsZXldVnjZ8IYSbRApmHUHUk rDHqGmwAvyX/tK9PBcpUwr0/H4YKd7/jO33bHNwPW1uHsB0Erta/cBH8sQN9f0+YaM sZ7c6p3+gKtmYL+JJkxiV+nv2Xe+8QSkuykPt+Emxc5HhbWg3wmMPoPN7QZEsoFHmF xQIv+03zGN+vw== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uvfSe-000000075PT-3rjC; Mon, 08 Sep 2025 13:15:24 -0400 Message-ID: <20250908171524.779521748@kernel.org> User-Agent: quilt/0.68 Date: Mon, 08 Sep 2025 13:14:15 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org, x86@kernel.org Cc: Masami Hiramatsu , Mathieu Desnoyers , Josh Poimboeuf , Peter Zijlstra , Ingo Molnar , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Linus Torvalds , Andrew Morton , Florian Weimer , Sam James , Kees Cook , "Carlos O'Donell" Subject: [RESEND][PATCH v15 3/4] perf: Have the deferred request record the user context cookie References: <20250908171412.268168931@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt When a request to have a deferred unwind is made, have the cookie associated to the user context recorded in the event that represents that request. It is added after the PERF_CONTEXT_USER_DEFERRED in the callchain. That perf context is a marker of where to add the associated user space stack trace in the callchain. Adding the cookie after that marker will not affect the appending of the callchain as it will be overwritten by the user space stack in the perf tool. The cookie will be used to match the cookie that is saved when the deferred callchain is recorded. The perf tool will be able to use the cooking saved at the request to know if the callchain that was recorded when the task goes back to user space is for that event. If there were dropped events after the request was made where it dropped the calltrace that happened when the task went back to user space and then came back into the kernel and a new request was dropped, but then the record started again and it recorded a new callchain going back to user space, this callchain would not be for the initial request. The cookie matching will prevent this scenario from happening. The cookie prevents: record kernel stack trace with PERF_CONTEXT_USER_DEFERRED [ dropped events starts here ] record user stack trace - DROPPED [enters user space ] [exits user space back to the kernel ] record kernel stack trace with PERF_CONTEXT_USER_DEFERRED - DROPPED! [ events stop being dropped here ] record user stack trace Without a differentiating "cookie" identifier, the user space tool will incorrectly attach the last recorded user stack trace to the first kernel stack trace with the PERF_CONTEXT_USER_DEFERRED, as using the TID is not enough to identify this situation. Signed-off-by: Steven Rostedt (Google) --- include/linux/perf_event.h | 2 +- include/uapi/linux/perf_event.h | 5 +++++ kernel/bpf/stackmap.c | 4 ++-- kernel/events/callchain.c | 9 ++++++--- kernel/events/core.c | 11 +++++++---- tools/include/uapi/linux/perf_event.h | 5 +++++ 6 files changed, 26 insertions(+), 10 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 1527afa952f7..c8eefbc9ce51 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -1725,7 +1725,7 @@ extern void perf_callchain_user(struct perf_callchain= _entry_ctx *entry, struct p extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, = struct pt_regs *regs); extern struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, - u32 max_stack, bool crosstask, bool add_mark, bool defer_user); + u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie); extern int get_callchain_buffers(int max_stack); extern void put_callchain_buffers(void); extern struct perf_callchain_entry *get_callchain_entry(int *rctx); diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_even= t.h index 20b8f890113b..79232e85a8fc 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -1282,6 +1282,11 @@ enum perf_bpf_event_type { #define PERF_MAX_STACK_DEPTH 127 #define PERF_MAX_CONTEXTS_PER_STACK 8 =20 +/* + * The PERF_CONTEXT_USER_DEFERRED has two items (context and cookie) + */ +#define PERF_DEFERRED_ITEMS 2 + enum perf_callchain_context { PERF_CONTEXT_HV =3D (__u64)-32, PERF_CONTEXT_KERNEL =3D (__u64)-128, diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 339f7cbbcf36..ef6021111fe3 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, str= uct bpf_map *, map, max_depth =3D sysctl_perf_event_max_stack; =20 trace =3D get_perf_callchain(regs, kernel, user, max_depth, - false, false, false); + false, false, 0); =20 if (unlikely(!trace)) /* couldn't fetch the stack trace */ @@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struc= t task_struct *task, trace =3D get_callchain_entry_for_task(task, max_depth); else trace =3D get_perf_callchain(regs, kernel, user, max_depth, - crosstask, false, false); + crosstask, false, 0); =20 if (unlikely(!trace) || trace->nr < skip) { if (may_fault) diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index d0e0da66a164..b9c7e00725d6 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct p= erf_callchain_entry *entr =20 struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, - u32 max_stack, bool crosstask, bool add_mark, bool defer_user) + u32 max_stack, bool crosstask, bool add_mark, u64 defer_cookie) { struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; @@ -251,12 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel,= bool user, regs =3D task_pt_regs(current); } =20 - if (defer_user) { + if (defer_cookie) { /* * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED - * which can be stitched to this one. + * which can be stitched to this one, and add + * the cookie after it (it will be cut off when the + * user stack is copied to the callchain). */ perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED); + perf_callchain_store_context(&ctx, defer_cookie); goto exit_put; } =20 diff --git a/kernel/events/core.c b/kernel/events/core.c index 37e684edbc8a..db4ca7e4afb1 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -8290,7 +8290,7 @@ static struct perf_callchain_entry __empty_callchain = =3D { .nr =3D 0, }; * 0 : if it performed the queuing * < 0 : if it did not get queued. */ -static int deferred_request(struct perf_event *event) +static int deferred_request(struct perf_event *event, u64 *defer_cookie) { struct callback_head *work =3D &event->pending_unwind_work; int pending; @@ -8306,6 +8306,8 @@ static int deferred_request(struct perf_event *event) =20 guard(irqsave)(); =20 + *defer_cookie =3D unwind_user_get_cookie(); + /* callback already pending? */ pending =3D READ_ONCE(event->pending_unwind_callback); if (pending) @@ -8334,6 +8336,7 @@ perf_callchain(struct perf_event *event, struct pt_re= gs *regs) bool crosstask =3D event->ctx->task && event->ctx->task !=3D current; const u32 max_stack =3D event->attr.sample_max_stack; struct perf_callchain_entry *callchain; + u64 defer_cookie =3D 0; /* perf currently only supports deferred in 64bit */ bool defer_user =3D IS_ENABLED(CONFIG_UNWIND_USER) && user && event->attr.defer_callchain; @@ -8349,15 +8352,15 @@ perf_callchain(struct perf_event *event, struct pt_= regs *regs) return &__empty_callchain; =20 if (defer_user) { - int ret =3D deferred_request(event); + int ret =3D deferred_request(event, &defer_cookie); if (!ret) local_inc(&event->ctx->nr_no_switch_fast); else if (ret < 0) - defer_user =3D false; + defer_cookie =3D 0; } =20 callchain =3D get_perf_callchain(regs, kernel, user, max_stack, - crosstask, true, defer_user); + crosstask, true, defer_cookie); =20 return callchain ?: &__empty_callchain; } diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/lin= ux/perf_event.h index 20b8f890113b..79232e85a8fc 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -1282,6 +1282,11 @@ enum perf_bpf_event_type { #define PERF_MAX_STACK_DEPTH 127 #define PERF_MAX_CONTEXTS_PER_STACK 8 =20 +/* + * The PERF_CONTEXT_USER_DEFERRED has two items (context and cookie) + */ +#define PERF_DEFERRED_ITEMS 2 + enum perf_callchain_context { PERF_CONTEXT_HV =3D (__u64)-32, PERF_CONTEXT_KERNEL =3D (__u64)-128, --=20 2.50.1 From nobody Wed Sep 10 01:53:14 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AEFD4313E10; Mon, 8 Sep 2025 17:14:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351680; cv=none; b=ZQxYQIr+DYPzmgtE/8kSt81E50zlJhZesXsPIJl0XJh3KIpW6jVJmdNGWCbnwLf9mmdSfP4o4hn+AmimY+mPgTyKtW5kUXHlycs9qptawoE5yQzpVE5qo5S7g+LOxZ7uI6BWYg5BwATy1LLDHOFQo2wkiC6QIlHwrEgA6IzsAQ4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757351680; c=relaxed/simple; bh=XDCQGzNCd5CfxCTBq8qB3GWeSBiyTPC2GtP9bJQSgdg=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=pGZugQtlSyJ9XBEdAJmKxZDTS0+iGxQ4ifxHGORvftSBxADsX2nxrikJmbW5u++JYynl+LvXYyndTU6TDarFZ/iPYyM9h/gseNVngfeV2DyEOaSIvSZd/51Z5fveUNhT2urxGfrBIBASN50XmfphRdRfCLpIpkDCoup89DpoWc0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=Asv1zW9F; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="Asv1zW9F" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0B491C4CEF5; Mon, 8 Sep 2025 17:14:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1757351680; bh=XDCQGzNCd5CfxCTBq8qB3GWeSBiyTPC2GtP9bJQSgdg=; h=Date:From:To:Cc:Subject:References:From; b=Asv1zW9F4wNUP7Fbn4PRzKNcOa8tnxB7rHPQggoZXm6vgARf2OsqI7eowhcxHgLCC Xp0KXiXC62c/1m1JX/vlC6zj+CPMNeako46iPXuuAtl/FB7PF46a8bJUU6hbfaLnq0 0QnKuqIdatv8uZmS6+q5e9eufiUR/NJ6Ud+09T+0VrFm6lJ+9V0L1gpwiM+Z9BAQBu 20kvu/I1ZERnykzVH6I4L2hSoWh2yTxI4oO7bJYlGG5U0oMTbt6CfdNn/TkCFVKErt 7yumtAXx7u17JGQxs+ZGDgArx4ZUu1fNGMzfzotUoe11we63Jrc0xsQEb0fvp/VndH hVwr6nKpFRZeg== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uvfSf-000000075Px-0O3e; Mon, 08 Sep 2025 13:15:25 -0400 Message-ID: <20250908171524.943453280@kernel.org> User-Agent: quilt/0.68 Date: Mon, 08 Sep 2025 13:14:16 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org, x86@kernel.org Cc: Masami Hiramatsu , Mathieu Desnoyers , Josh Poimboeuf , Peter Zijlstra , Ingo Molnar , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Linus Torvalds , Andrew Morton , Florian Weimer , Sam James , Kees Cook , "Carlos O'Donell" Subject: [RESEND][PATCH v15 4/4] perf: Support deferred user callchains for per CPU events References: <20250908171412.268168931@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt The deferred unwinder works fine for task events (events that trace only a specific task), as it can use a task_work from an interrupt or NMI and when the task goes back to user space it will call the event's callback to do the deferred unwinding. But for per CPU events things are not so simple. When a per CPU event wants a deferred unwinding to occur, it cannot simply use a task_work as there's a many to many relationship. If the task migrates and another task is scheduled in where the per CPU event wants a deferred unwinding to occur on that task as well, and the task that migrated to another CPU has that CPU's event want to unwind it too, each CPU may need unwinding from more than one task, and each task may have requests from many CPUs. The main issue is that from the kernel point of view, there's currently nothing that associates a per CPU event for one CPU to the per CPU events that cover the other CPUs for a given process. To the kernel, they are all just individual events buffers. This is problematic if a delayed request is made on one CPU and the task migrates to another CPU where the delayed user stack trace will be performed. The kernel needs to know which CPU buffer to add it to that belongs to the same process that initiated the deferred request. To solve this, when a per CPU event is created that has defer_callchain attribute set, it will do a lookup from a global list (unwind_deferred_list), for a perf_unwind_deferred descriptor that has the id that matches the PID of the current task's group_leader. (The process ID for all the threads of a process) If it is not found, then it will create one and add it to the global list. This descriptor contains an array of all possible CPUs, where each element is a perf_unwind_cpu descriptor. The perf_unwind_cpu descriptor has a list of all the per CPU events that is tracing the matching CPU that corresponds to its index in the array, where the events belong to a task that has the same group_leader. It also has a processing bit and rcuwait to handle removal. For each occupied perf_unwind_cpu descriptor in the array, the perf_deferred_unwind descriptor increments its nr_cpu_events. When a perf_unwind_cpu descriptor is empty, the nr_cpu_events is decremented. This is used to know when to free the perf_deferred_unwind descriptor, as when it becomes empty, it is no longer referenced. Finally, the perf_deferred_unwind descriptor has an id that holds the PID of the group_leader for the tasks that the events were created by. When a second (or more) per CPU event is created where the perf_deferred_unwind descriptor already exists, it just adds itself to the perf_unwind_cpu array of that descriptor. Updating the necessary counter. This is used to map different per CPU events to each other based on their group leader PID. Each of these perf_deferred_unwind descriptors have a unwind_work that registers with the deferred unwind infrastructure via unwind_deferred_init(), where it also registers a callback to perf_event_deferred_cpu(). Now when a per CPU event requests a deferred unwinding, it calls unwind_deferred_request() with the associated perf_deferred_unwind descriptor. It is expected that the program that uses this has events on all CPUs, as the deferred trace may not be called on the CPU event that requested it. That is, the task may migrate and its user stack trace will be recorded on the CPU event of the CPU that it exits back to user space on. Signed-off-by: Steven Rostedt (Google) --- include/linux/perf_event.h | 4 + kernel/events/core.c | 320 +++++++++++++++++++++++++++++++++---- 2 files changed, 295 insertions(+), 29 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index c8eefbc9ce51..0edc7ad4c914 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -733,6 +733,7 @@ struct swevent_hlist { struct bpf_prog; struct perf_cgroup; struct perf_buffer; +struct perf_unwind_deferred; =20 struct pmu_event_list { raw_spinlock_t lock; @@ -885,6 +886,9 @@ struct perf_event { struct callback_head pending_unwind_work; struct rcuwait pending_unwind_wait; =20 + struct perf_unwind_deferred *unwind_deferred; + struct list_head unwind_list; + atomic_t event_limit; =20 /* address range filters */ diff --git a/kernel/events/core.c b/kernel/events/core.c index db4ca7e4afb1..303ab50eca8b 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5582,10 +5582,193 @@ static bool exclusive_event_installable(struct per= f_event *event, return true; } =20 +/* Holds a list of per CPU events that registered for deferred unwinding */ +struct perf_unwind_cpu { + struct list_head list; + struct rcuwait pending_unwind_wait; + int processing; +}; + +struct perf_unwind_deferred { + struct list_head list; + struct unwind_work unwind_work; + struct perf_unwind_cpu __rcu *cpu_events; + struct rcu_head rcu_head; + int nr_cpu_events; + int id; +}; + +static DEFINE_MUTEX(unwind_deferred_mutex); +static LIST_HEAD(unwind_deferred_list); + +static void perf_event_deferred_cpu(struct unwind_work *work, + struct unwind_stacktrace *trace, u64 cookie); + +/* + * Add a per CPU event. + * + * The deferred callstack can happen on a different CPU than what was + * requested. If one CPU event requests a deferred callstack, but the + * tasks migrates, it will execute on a different CPU and save the + * stack trace to that CPU event. + * + * In order to map all the CPU events with the same application, + * use the current->gorup_leader->pid as the identifier of what + * events share the same program. + * + * A perf_unwind_deferred descriptor is created for each unique + * group_leader pid, and all the events that have the same group_leader + * pid will be linked to the same deferred descriptor. + * + * If there's no descriptor that matches the current group_leader pid, + * one will be created. + */ +static int perf_add_unwind_deferred(struct perf_event *event) +{ + struct perf_unwind_deferred *defer; + struct perf_unwind_cpu *cpu_events; + int id =3D current->group_leader->pid; + bool found =3D false; + int ret =3D 0; + + if (event->cpu < 0) + return -EINVAL; + + guard(mutex)(&unwind_deferred_mutex); + + list_for_each_entry(defer, &unwind_deferred_list, list) { + if (defer->id =3D=3D id) { + found =3D true; + break; + } + } + + if (!found) { + defer =3D kzalloc(sizeof(*defer), GFP_KERNEL); + if (!defer) + return -ENOMEM; + list_add(&defer->list, &unwind_deferred_list); + defer->id =3D id; + } + + /* + * The deferred desciptor has an array for every CPU. + * Each entry in this array is a link list of all the CPU + * events for the corresponding CPU. This is a quick way to + * find the associated event for a given CPU in + * perf_event_deferred_cpu(). + */ + if (!defer->nr_cpu_events) { + cpu_events =3D kcalloc(num_possible_cpus(), + sizeof(*cpu_events), + GFP_KERNEL); + if (!cpu_events) { + ret =3D -ENOMEM; + goto free; + } + for (int cpu =3D 0; cpu < num_possible_cpus(); cpu++) { + rcuwait_init(&cpu_events[cpu].pending_unwind_wait); + INIT_LIST_HEAD(&cpu_events[cpu].list); + } + + rcu_assign_pointer(defer->cpu_events, cpu_events); + + ret =3D unwind_deferred_init(&defer->unwind_work, + perf_event_deferred_cpu); + if (ret) + goto free; + } + cpu_events =3D rcu_dereference_protected(defer->cpu_events, + lockdep_is_held(&unwind_deferred_mutex)); + + /* + * The defer->nr_cpu_events is the count of the number + * of non-empty lists in the cpu_events array. If the list + * being added to is already occupied, the nr_cpu_events does + * not need to get incremented. + */ + if (list_empty(&cpu_events[event->cpu].list)) + defer->nr_cpu_events++; + list_add_tail_rcu(&event->unwind_list, &cpu_events[event->cpu].list); + + event->unwind_deferred =3D defer; + return 0; +free: + /* Nothing to do if there was already an existing event attached */ + if (found) + return ret; + + list_del(&defer->list); + kfree(cpu_events); + kfree(defer); + return ret; +} + +static void free_unwind_deferred_rcu(struct rcu_head *head) +{ + struct perf_unwind_cpu *cpu_events; + struct perf_unwind_deferred *defer =3D + container_of(head, struct perf_unwind_deferred, rcu_head); + + WARN_ON_ONCE(defer->nr_cpu_events); + /* + * This is called by call_rcu() and there are no more + * references to cpu_events. + */ + cpu_events =3D rcu_dereference_protected(defer->cpu_events, true); + kfree(cpu_events); + kfree(defer); +} + +static void perf_remove_unwind_deferred(struct perf_event *event) +{ + struct perf_unwind_deferred *defer =3D event->unwind_deferred; + struct perf_unwind_cpu *cpu_events, *cpu_unwind; + + if (!defer) + return; + + guard(mutex)(&unwind_deferred_mutex); + list_del_rcu(&event->unwind_list); + + cpu_events =3D rcu_dereference_protected(defer->cpu_events, + lockdep_is_held(&unwind_deferred_mutex)); + cpu_unwind =3D &cpu_events[event->cpu]; + + if (list_empty(&cpu_unwind->list)) { + defer->nr_cpu_events--; + if (!defer->nr_cpu_events) + unwind_deferred_cancel(&defer->unwind_work); + } + + event->unwind_deferred =3D NULL; + + /* + * Make sure perf_event_deferred_cpu() is done with this event. + * That function will set cpu_unwind->processing and then + * call smp_mb() before iterating the list of its events. + * If the event's unwind_deferred is NULL, it will be skipped. + * The smp_mb() in that function matches the mb() in + * rcuwait_wait_event(). + */ + rcuwait_wait_event(&cpu_unwind->pending_unwind_wait, + !cpu_unwind->processing, TASK_UNINTERRUPTIBLE); + + /* Is this still being used by other per CPU events? */ + if (defer->nr_cpu_events) + return; + + list_del(&defer->list); + /* The defer->cpu_events is protected by RCU */ + call_rcu(&defer->rcu_head, free_unwind_deferred_rcu); +} + static void perf_pending_unwind_sync(struct perf_event *event) { might_sleep(); =20 + perf_remove_unwind_deferred(event); + if (!event->pending_unwind_callback) return; =20 @@ -5614,63 +5797,119 @@ struct perf_callchain_deferred_event { u64 ips[]; }; =20 -static void perf_event_callchain_deferred(struct callback_head *work) +static void perf_event_callchain_deferred(struct perf_event *event, + struct unwind_stacktrace *trace, + u64 cookie) { - struct perf_event *event =3D container_of(work, struct perf_event, pendin= g_unwind_work); struct perf_callchain_deferred_event deferred_event; u64 callchain_context =3D PERF_CONTEXT_USER; - struct unwind_stacktrace trace; struct perf_output_handle handle; struct perf_sample_data data; u64 nr; =20 - if (!event->pending_unwind_callback) - return; - - if (unwind_user_faultable(&trace) < 0) - goto out; - - /* - * All accesses to the event must belong to the same implicit RCU - * read-side critical section as the ->pending_unwind_callback reset. - * See comment in perf_pending_unwind_sync(). - */ - guard(rcu)(); - if (current->flags & (PF_KTHREAD | PF_USER_WORKER)) - goto out; + return; =20 - nr =3D trace.nr + 1 ; /* '+1' =3D=3D callchain_context */ + nr =3D trace->nr + 1 ; /* '+1' =3D=3D callchain_context */ =20 deferred_event.header.type =3D PERF_RECORD_CALLCHAIN_DEFERRED; deferred_event.header.misc =3D PERF_RECORD_MISC_USER; deferred_event.header.size =3D sizeof(deferred_event) + (nr * sizeof(u64)= ); =20 deferred_event.nr =3D nr; - deferred_event.cookie =3D unwind_user_get_cookie(); + deferred_event.cookie =3D cookie; =20 perf_event_header__init_id(&deferred_event.header, &data, event); =20 if (perf_output_begin(&handle, &data, event, deferred_event.header.size)) - goto out; + return; =20 perf_output_put(&handle, deferred_event); perf_output_put(&handle, callchain_context); - /* trace.entries[] are not guaranteed to be 64bit */ - for (int i =3D 0; i < trace.nr; i++) { - u64 entry =3D trace.entries[i]; + /* trace->entries[] are not guaranteed to be 64bit */ + for (int i =3D 0; i < trace->nr; i++) { + u64 entry =3D trace->entries[i]; perf_output_put(&handle, entry); } perf_event__output_id_sample(event, &handle, &data); =20 perf_output_end(&handle); +} + +/* Deferred unwinding callback for task specific events */ +static void perf_event_deferred_task(struct callback_head *work) +{ + struct perf_event *event =3D container_of(work, struct perf_event, pendin= g_unwind_work); + struct unwind_stacktrace trace; + + if (!event->pending_unwind_callback) + return; + + if (unwind_user_faultable(&trace) >=3D 0) { + u64 cookie =3D unwind_user_get_cookie(); + + /* + * All accesses to the event must belong to the same implicit RCU + * read-side critical section as the ->pending_unwind_callback reset. + * See comment in perf_pending_unwind_sync(). + */ + guard(rcu)(); + perf_event_callchain_deferred(event, &trace, cookie); + } =20 -out: event->pending_unwind_callback =3D 0; local_dec(&event->ctx->nr_no_switch_fast); rcuwait_wake_up(&event->pending_unwind_wait); } =20 +/* + * Deferred unwinding callback for per CPU events. + * Note, the request for the deferred unwinding may have happened + * on a different CPU. + */ +static void perf_event_deferred_cpu(struct unwind_work *work, + struct unwind_stacktrace *trace, u64 cookie) +{ + struct perf_unwind_deferred *defer =3D + container_of(work, struct perf_unwind_deferred, unwind_work); + struct perf_unwind_cpu *cpu_events, *cpu_unwind; + struct perf_event *event; + int cpu; + + guard(rcu)(); + guard(preempt)(); + + cpu =3D smp_processor_id(); + cpu_events =3D rcu_dereference(defer->cpu_events); + cpu_unwind =3D &cpu_events[cpu]; + + WRITE_ONCE(cpu_unwind->processing, 1); + /* + * Make sure the above is seen before the event->unwind_deferred + * is checked. This matches the mb() in rcuwait_rcu_wait_event() in + * perf_remove_unwind_deferred(). + */ + smp_mb(); + + list_for_each_entry_rcu(event, &cpu_unwind->list, unwind_list) { + /* If unwind_deferred is NULL the event is going away */ + if (unlikely(!event->unwind_deferred)) + continue; + perf_event_callchain_deferred(event, trace, cookie); + /* Only the first CPU event gets the trace */ + break; + } + + /* + * The perf_event_callchain_deferred() must finish before setting + * cpu_unwind->processing to zero. This is also to synchronize + * with the rcuwait in perf_remove_unwind_deferred(). + */ + smp_mb(); + WRITE_ONCE(cpu_unwind->processing, 0); + rcuwait_wake_up(&cpu_unwind->pending_unwind_wait); +} + static void perf_free_addr_filters(struct perf_event *event); =20 /* vs perf_event_alloc() error */ @@ -8284,6 +8523,17 @@ static u64 perf_get_page_size(unsigned long addr) =20 static struct perf_callchain_entry __empty_callchain =3D { .nr =3D 0, }; =20 + +static int deferred_unwind_request(struct perf_unwind_deferred *defer, + u64 *defer_cookie) +{ + /* + * Returns 0 for queued, 1 for already queued or executed, + * and negative on error. + */ + return unwind_deferred_request(&defer->unwind_work, defer_cookie); +} + /* * Returns: * > 0 : if already queued. @@ -8293,17 +8543,22 @@ static struct perf_callchain_entry __empty_callchai= n =3D { .nr =3D 0, }; static int deferred_request(struct perf_event *event, u64 *defer_cookie) { struct callback_head *work =3D &event->pending_unwind_work; + struct perf_unwind_deferred *defer; int pending; int ret; =20 - /* Only defer for task events */ - if (!event->ctx->task) - return -EINVAL; - if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) || !user_mode(task_pt_regs(current))) return -EINVAL; =20 + defer =3D READ_ONCE(event->unwind_deferred); + if (defer) + return deferred_unwind_request(defer, defer_cookie); + + /* Per CPU events should have had unwind_deferred set! */ + if (WARN_ON_ONCE(!event->ctx->task)) + return -EINVAL; + guard(irqsave)(); =20 *defer_cookie =3D unwind_user_get_cookie(); @@ -13197,13 +13452,20 @@ perf_event_alloc(struct perf_event_attr *attr, in= t cpu, } } =20 + /* Setup unwind deferring for per CPU events */ + if (event->attr.defer_callchain && !task) { + err =3D perf_add_unwind_deferred(event); + if (err) + return ERR_PTR(err); + } + err =3D security_perf_event_alloc(event); if (err) return ERR_PTR(err); =20 if (event->attr.defer_callchain) init_task_work(&event->pending_unwind_work, - perf_event_callchain_deferred); + perf_event_deferred_task); =20 /* symmetric to unaccount_event() in _free_event() */ account_event(event); --=20 2.50.1