From nobody Wed Oct 8 07:35:07 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CFB7F253F3D; Tue, 7 Oct 2025 21:39:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759873171; cv=none; b=PT75WCyIExniJ7eqtot9GhaaC0T7UWeOfxgTQMFJodsbNMQAy1OV4QdfT+7x6e7DaSsw3ydPAa/T/6vzcjz4RtNVrgypcUFkJeWgRcHa1LruhJI2ZgJD83sWt2EXQXUZJaNDUNxLmuflRXnSHpHu4TZo55QvtAoZVI9g3DiRdN4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1759873171; c=relaxed/simple; bh=RbmBYiPpCFuN1S5X77JBvvRnJR/euKv0kkHDqGcrugQ=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=DpW3mYRa+9kwOx8gCbAFpZF2QpAz7sG4tnLHc0FP1dNXiiX/Xm1CZb2YlM4fdy6VebSihnpnyN3uAUBUZHdw+7LTaMWPYAVsL0pmCiPFvUi3+xlDuXnA/IC64Wqhzwxs1Ur2vHbDK7ZDHPLvCZefLJyaRkiuQmdB33AwNAgDS7o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=BEQKNpmg; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="BEQKNpmg" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 55638C113D0; Tue, 7 Oct 2025 21:39:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1759873171; bh=RbmBYiPpCFuN1S5X77JBvvRnJR/euKv0kkHDqGcrugQ=; h=Date:From:To:Cc:Subject:References:From; b=BEQKNpmgpq2zgKNhmOwrMV1CwVc/Kp3EaHwUgA4TlnRItntfoq0i8fVf/p377t3Uj JUHVWMfgO79Z2ZEOj482e8xW3QuW25T1R9SStthu9QqL70tmCbXB6F8B6T+hh7Qjey DcCuiJnNO2MEYXw4E+BE9qJl2rJD33MA1xiGdR6IzKu4Yw4x5XoEJhD9tV9ollilw9 GowdwOtokB+t69f/DAUHE00sx9WjLG2c18gRp453vLFEy67ToolPCAthub/a6Ufgdc Mdiw1hr2+HBEfsOUhfvQerl6/2tFwm6zJ+SponJAWnpyp9M5HNSfj1uL1fHXJ8wyP3 oAw0a7xD2j5aA== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1v6FQx-00000007XgD-3Zdc; Tue, 07 Oct 2025 17:41:23 -0400 Message-ID: <20251007214123.705413492@kernel.org> User-Agent: quilt/0.68 Date: Tue, 07 Oct 2025 17:40:10 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org, x86@kernel.org Cc: Masami Hiramatsu , Mathieu Desnoyers , Josh Poimboeuf , Peter Zijlstra , Ingo Molnar , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Linus Torvalds , Andrew Morton , Florian Weimer , Sam James , Kees Cook , "Carlos O'Donell" Subject: [PATCH v16 2/4] perf: Support deferred user callchains References: <20251007214008.080852573@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Josh Poimboeuf If the user fault unwind is available (the one that will be used for sframes), have perf be able to utilize it. Currently all user stack traces are done at the request site. This mostly happens in interrupt or NMI context where user space is only accessible if it is currently present in memory. It is possible that the user stack was swapped out and is not present, but mostly the use of sframes will require faulting in user pages which will not be possible from interrupt context. Instead, add a frame work that will delay the reading of the user space stack until the task goes back to user space where faulting in pages is possible. This is also advantageous as the user space stack doesn't change while in the kernel, and this will also remove duplicate entries of user space stacks for a long running system call being profiled. A new perf context is created called PERF_CONTEXT_USER_DEFERRED. It is added to the kernel callchain, usually when an interrupt or NMI is triggered (but can be added to any callchain). When a deferred unwind is required, it uses the new deferred unwind infrastructure. When tracing a single task and a user stack trace is required, perf will call unwind_deferred_request(). This will trigger a task_work that on task kernel space exit will call the perf function perf_event_deferred_task() with the user stacktrace and a cookie (an identifier for that stack trace). This user stack trace will go into a new perf type called PERF_RECORD_CALLCHAIN_DEFERRED. The perf user space will need to attach this stack trace to each of the previous kernel callchains for that task with the PERF_CONTEXT_USER_DEFERRED context in them. Suggested-by: Peter Zijlstra Co-developed-by: Steven Rostedt (Google) Signed-off-by: Josh Poimboeuf Signed-off-by: Steven Rostedt (Google) --- Changes since v15: https://lore.kernel.org/20250825180801.727927527@kernel.= org - Peter Zijlstra pointed out that the code mostly duplicated the code of the unwind infrastructure, and had the same bugs as it had. The unwind infrastructure was updated to allow a tracer to use it for a single task. The perf code now uses that which greatly simplified this version over the previous one. include/linux/perf_event.h | 5 +- include/uapi/linux/perf_event.h | 20 ++++- kernel/bpf/stackmap.c | 4 +- kernel/events/callchain.c | 11 ++- kernel/events/core.c | 110 +++++++++++++++++++++++++- tools/include/uapi/linux/perf_event.h | 20 ++++- 6 files changed, 162 insertions(+), 8 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index fd1d91017b99..152e3dacff98 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -53,6 +53,7 @@ #include #include #include +#include =20 #include =20 @@ -880,6 +881,8 @@ struct perf_event { struct callback_head pending_task; unsigned int pending_work; =20 + struct unwind_work unwind_work; + atomic_t event_limit; =20 /* address range filters */ @@ -1720,7 +1723,7 @@ extern void perf_callchain_user(struct perf_callchain= _entry_ctx *entry, struct p extern void perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, = struct pt_regs *regs); extern struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, - u32 max_stack, bool crosstask, bool add_mark); + u32 max_stack, bool crosstask, bool add_mark, bool defer_user); extern int get_callchain_buffers(int max_stack); extern void put_callchain_buffers(void); extern struct perf_callchain_entry *get_callchain_entry(int *rctx); diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_even= t.h index 78a362b80027..20b8f890113b 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -463,7 +463,8 @@ struct perf_event_attr { inherit_thread : 1, /* children only inherit if cloned with CLONE_THR= EAD */ remove_on_exec : 1, /* event is removed from task on exec */ sigtrap : 1, /* send synchronous SIGTRAP on event */ - __reserved_1 : 26; + defer_callchain: 1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED record= s */ + __reserved_1 : 25; =20 union { __u32 wakeup_events; /* wake up every n events */ @@ -1239,6 +1240,22 @@ enum perf_event_type { */ PERF_RECORD_AUX_OUTPUT_HW_ID =3D 21, =20 + /* + * This user callchain capture was deferred until shortly before + * returning to user space. Previous samples would have kernel + * callchains only and they need to be stitched with this to make full + * callchains. + * + * struct { + * struct perf_event_header header; + * u64 cookie; + * u64 nr; + * u64 ips[nr]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_CALLCHAIN_DEFERRED =3D 22, + PERF_RECORD_MAX, /* non-ABI */ }; =20 @@ -1269,6 +1286,7 @@ enum perf_callchain_context { PERF_CONTEXT_HV =3D (__u64)-32, PERF_CONTEXT_KERNEL =3D (__u64)-128, PERF_CONTEXT_USER =3D (__u64)-512, + PERF_CONTEXT_USER_DEFERRED =3D (__u64)-640, =20 PERF_CONTEXT_GUEST =3D (__u64)-2048, PERF_CONTEXT_GUEST_KERNEL =3D (__u64)-2176, diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index ec3a57a5fba1..339f7cbbcf36 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -315,7 +315,7 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, str= uct bpf_map *, map, max_depth =3D sysctl_perf_event_max_stack; =20 trace =3D get_perf_callchain(regs, kernel, user, max_depth, - false, false); + false, false, false); =20 if (unlikely(!trace)) /* couldn't fetch the stack trace */ @@ -452,7 +452,7 @@ static long __bpf_get_stack(struct pt_regs *regs, struc= t task_struct *task, trace =3D get_callchain_entry_for_task(task, max_depth); else trace =3D get_perf_callchain(regs, kernel, user, max_depth, - crosstask, false); + crosstask, false, false); =20 if (unlikely(!trace) || trace->nr < skip) { if (may_fault) diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index 808c0d7a31fa..d0e0da66a164 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -218,7 +218,7 @@ static void fixup_uretprobe_trampoline_entries(struct p= erf_callchain_entry *entr =20 struct perf_callchain_entry * get_perf_callchain(struct pt_regs *regs, bool kernel, bool user, - u32 max_stack, bool crosstask, bool add_mark) + u32 max_stack, bool crosstask, bool add_mark, bool defer_user) { struct perf_callchain_entry *entry; struct perf_callchain_entry_ctx ctx; @@ -251,6 +251,15 @@ get_perf_callchain(struct pt_regs *regs, bool kernel, = bool user, regs =3D task_pt_regs(current); } =20 + if (defer_user) { + /* + * Foretell the coming of PERF_RECORD_CALLCHAIN_DEFERRED + * which can be stitched to this one. + */ + perf_callchain_store_context(&ctx, PERF_CONTEXT_USER_DEFERRED); + goto exit_put; + } + if (add_mark) perf_callchain_store_context(&ctx, PERF_CONTEXT_USER); =20 diff --git a/kernel/events/core.c b/kernel/events/core.c index 28de3baff792..be94b437e7e0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5582,6 +5582,67 @@ static bool exclusive_event_installable(struct perf_= event *event, return true; } =20 +static void perf_pending_unwind_sync(struct perf_event *event) +{ + struct unwind_work *work =3D &event->unwind_work; + + unwind_deferred_cancel(work); +} + +struct perf_callchain_deferred_event { + struct perf_event_header header; + u64 cookie; + u64 nr; + u64 ips[]; +}; + +static void perf_event_callchain_deferred(struct perf_event *event, + struct unwind_stacktrace *trace, + u64 cookie) +{ + struct perf_callchain_deferred_event deferred_event; + u64 callchain_context =3D PERF_CONTEXT_USER; + struct perf_output_handle handle; + struct perf_sample_data data; + u64 nr; + + nr =3D trace->nr + 1 ; /* '+1' =3D=3D callchain_context */ + + deferred_event.header.type =3D PERF_RECORD_CALLCHAIN_DEFERRED; + deferred_event.header.misc =3D PERF_RECORD_MISC_USER; + deferred_event.header.size =3D sizeof(deferred_event) + (nr * sizeof(u64)= ); + + deferred_event.nr =3D nr; + deferred_event.cookie =3D cookie; + + perf_event_header__init_id(&deferred_event.header, &data, event); + + if (perf_output_begin(&handle, &data, event, deferred_event.header.size)) + return; + + perf_output_put(&handle, deferred_event); + perf_output_put(&handle, callchain_context); + /* trace->entries[] are not guaranteed to be 64bit */ + for (int i =3D 0; i < trace->nr; i++) { + u64 entry =3D trace->entries[i]; + perf_output_put(&handle, entry); + } + perf_event__output_id_sample(event, &handle, &data); + + perf_output_end(&handle); +} + +/* Deferred unwinding callback for task specific events */ +static void perf_event_deferred_task(struct unwind_work *work, + struct unwind_stacktrace *trace, u64 cookie) +{ + struct perf_event *event =3D container_of(work, struct perf_event, unwind= _work); + + perf_event_callchain_deferred(event, trace, cookie); + + local_dec(&event->ctx->nr_no_switch_fast); +} + static void perf_free_addr_filters(struct perf_event *event); =20 /* vs perf_event_alloc() error */ @@ -5649,6 +5710,7 @@ static void _free_event(struct perf_event *event) { irq_work_sync(&event->pending_irq); irq_work_sync(&event->pending_disable_irq); + perf_pending_unwind_sync(event); =20 unaccount_event(event); =20 @@ -8194,6 +8256,28 @@ static u64 perf_get_page_size(unsigned long addr) =20 static struct perf_callchain_entry __empty_callchain =3D { .nr =3D 0, }; =20 +/* + * Returns: +* > 0 : if already queued. + * 0 : if it performed the queuing + * < 0 : if it did not get queued. + */ +static int deferred_request(struct perf_event *event) +{ + struct unwind_work *work =3D &event->unwind_work; + u64 cookie; + + /* Only defer for task events */ + if (!event->ctx->task) + return -EINVAL; + + if ((current->flags & (PF_KTHREAD | PF_USER_WORKER)) || + !user_mode(task_pt_regs(current))) + return -EINVAL; + + return unwind_deferred_request(work, &cookie); +} + struct perf_callchain_entry * perf_callchain(struct perf_event *event, struct pt_regs *regs) { @@ -8204,6 +8288,8 @@ perf_callchain(struct perf_event *event, struct pt_re= gs *regs) bool crosstask =3D event->ctx->task && event->ctx->task !=3D current; const u32 max_stack =3D event->attr.sample_max_stack; struct perf_callchain_entry *callchain; + bool defer_user =3D IS_ENABLED(CONFIG_UNWIND_USER) && user && + event->attr.defer_callchain; =20 if (!current->mm) user =3D false; @@ -8211,8 +8297,21 @@ perf_callchain(struct perf_event *event, struct pt_r= egs *regs) if (!kernel && !user) return &__empty_callchain; =20 - callchain =3D get_perf_callchain(regs, kernel, user, - max_stack, crosstask, true); + /* Disallow cross-task callchains. */ + if (event->ctx->task && event->ctx->task !=3D current) + return &__empty_callchain; + + if (defer_user) { + int ret =3D deferred_request(event); + if (!ret) + local_inc(&event->ctx->nr_no_switch_fast); + else if (ret < 0) + defer_user =3D false; + } + + callchain =3D get_perf_callchain(regs, kernel, user, max_stack, + crosstask, true, defer_user); + return callchain ?: &__empty_callchain; } =20 @@ -13046,6 +13145,13 @@ perf_event_alloc(struct perf_event_attr *attr, int= cpu, } } =20 + if (event->attr.defer_callchain) { + if (task) { + err =3D unwind_deferred_task_init(&event->unwind_work, + perf_event_deferred_task); + } + } + err =3D security_perf_event_alloc(event); if (err) return ERR_PTR(err); diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/lin= ux/perf_event.h index 78a362b80027..20b8f890113b 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -463,7 +463,8 @@ struct perf_event_attr { inherit_thread : 1, /* children only inherit if cloned with CLONE_THR= EAD */ remove_on_exec : 1, /* event is removed from task on exec */ sigtrap : 1, /* send synchronous SIGTRAP on event */ - __reserved_1 : 26; + defer_callchain: 1, /* generate PERF_RECORD_CALLCHAIN_DEFERRED record= s */ + __reserved_1 : 25; =20 union { __u32 wakeup_events; /* wake up every n events */ @@ -1239,6 +1240,22 @@ enum perf_event_type { */ PERF_RECORD_AUX_OUTPUT_HW_ID =3D 21, =20 + /* + * This user callchain capture was deferred until shortly before + * returning to user space. Previous samples would have kernel + * callchains only and they need to be stitched with this to make full + * callchains. + * + * struct { + * struct perf_event_header header; + * u64 cookie; + * u64 nr; + * u64 ips[nr]; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_CALLCHAIN_DEFERRED =3D 22, + PERF_RECORD_MAX, /* non-ABI */ }; =20 @@ -1269,6 +1286,7 @@ enum perf_callchain_context { PERF_CONTEXT_HV =3D (__u64)-32, PERF_CONTEXT_KERNEL =3D (__u64)-128, PERF_CONTEXT_USER =3D (__u64)-512, + PERF_CONTEXT_USER_DEFERRED =3D (__u64)-640, =20 PERF_CONTEXT_GUEST =3D (__u64)-2048, PERF_CONTEXT_GUEST_KERNEL =3D (__u64)-2176, --=20 2.50.1