From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E67BA242D8B for ; Sat, 26 Jul 2025 14:12:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; cv=none; b=D5udzHdyW5WddsLEls+USFvBZoUbLHc14Jg1S88DFFZP35Q5p2pQXayoZoYa9LMMnE59lsrSxXYWGjV6SQGgwaCuKxrC5biArjAOP44jUu/hFYJjqWxBHAAc6vTPjtj3CmtEcOD3hTr6WlqopKR2UbNXnw8JPHC8dyJDx9zrqS8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; c=relaxed/simple; bh=MNotWuKbP3Byp8p67h5ctwRyQpf8DtFB2RkPS5r7ouE=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=bJsSy92s/Vj75uq+VLfAxuSC+co8cro4CHldwVbzfZjP7acSHaZJ3Oaf9GqMy2SW9+cLnLcQbBwWepKQVfPUZcnxhm2pglvMILcuioPQ9BiTC6k35WHnUihPEXc/ds/LdXoz3wzdgf69WrR3o0iuDrEEiUNLPEk0SThYeu+8K7Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=VBTzMZpM; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="VBTzMZpM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 70832C4CEF1; Sat, 26 Jul 2025 14:12:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539136; bh=MNotWuKbP3Byp8p67h5ctwRyQpf8DtFB2RkPS5r7ouE=; h=Date:From:To:Cc:Subject:References:From; b=VBTzMZpMXfjLsZDgfqN00U3ABsWToY5s2jSaY+9MZocdt4+Ho+5gSY7YF2+lNT6Hv lwobg3LRZGe4bEQPffSEBSb+dmhmJrnxdSLkvkCYum0iZFAyvwbEJzHxTGgVjyXXsg lWaiteKedl99d9jNwFo5nwIL4k+NM1/dmGM9u6WXOAN6VHKGAR1qekj7VPvgUGtWAs 5MGvUPtYHVmYABmbQdBlBFmPpt+dMAfmTNPjMLy6jxzPD3+d9EMcx0u6t5ZtBLdw2P d5Vz1Pp1DGCn2soUW0gDZaeXT311ZhsPyTag91YUcYuFkA5nMnmtgFOjuv+C73oD/r XAPIgqiL9wYFg== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdQ-00000001sbf-0O5F; Sat, 26 Jul 2025 10:12:24 -0400 Message-ID: <20250726141223.943438001@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:05 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 01/10] unwind_user: Add user space unwinding API with frame pointer support References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Josh Poimboeuf Introduce a generic API for unwinding user stacks. In order to expand user space unwinding to be able to handle more complex scenarios, such as deferred unwinding and reading user space information, create a generic interface that all architectures can use that support the various unwinding methods. This is an alternative method for handling user space stack traces from the simple stack_trace_save_user() API. This does not replace that interface, but this interface will be used to expand the functionality of user space stack walking. None of the structures introduced will be exposed to user space tooling. Support for frame pointer unwinding is added. For an architecture to support frame pointer unwinding it needs to enable CONFIG_HAVE_UNWIND_USER_FP and define ARCH_INIT_USER_FP_FRAME. By encoding the frame offsets in struct unwind_user_frame, much of this code can also be reused for future unwinder implementations like sframe. Cc: Masami Hiramatsu Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185739.233988371@kernel.org Signed-off-by: Josh Poimboeuf Co-developed-by: Mathieu Desnoyers Link: https://lore.kernel.org/all/20250710164301.3094-2-mathieu.desnoyers@e= fficios.com/ Signed-off-by: Mathieu Desnoyers Co-developed-by: Steven Rostedt (Google) Signed-off-by: Steven Rostedt (Google) --- MAINTAINERS | 8 ++ arch/Kconfig | 7 ++ include/asm-generic/Kbuild | 1 + include/asm-generic/unwind_user.h | 5 ++ include/linux/unwind_user.h | 14 ++++ include/linux/unwind_user_types.h | 44 ++++++++++ kernel/Makefile | 1 + kernel/unwind/Makefile | 1 + kernel/unwind/user.c | 128 ++++++++++++++++++++++++++++++ 9 files changed, 209 insertions(+) create mode 100644 include/asm-generic/unwind_user.h create mode 100644 include/linux/unwind_user.h create mode 100644 include/linux/unwind_user_types.h create mode 100644 kernel/unwind/Makefile create mode 100644 kernel/unwind/user.c diff --git a/MAINTAINERS b/MAINTAINERS index fad6cb025a19..370d780fd5f8 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -25928,6 +25928,14 @@ F: Documentation/driver-api/uio-howto.rst F: drivers/uio/ F: include/linux/uio_driver.h =20 +USERSPACE STACK UNWINDING +M: Josh Poimboeuf +M: Steven Rostedt +S: Maintained +F: include/linux/unwind*.h +F: kernel/unwind/ + + UTIL-LINUX PACKAGE M: Karel Zak L: util-linux@vger.kernel.org diff --git a/arch/Kconfig b/arch/Kconfig index a3308a220f86..8e3fd723bd74 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -435,6 +435,13 @@ config HAVE_HARDLOCKUP_DETECTOR_ARCH It uses the same command line parameters, and sysctl interface, as the generic hardlockup detectors. =20 +config UNWIND_USER + bool + +config HAVE_UNWIND_USER_FP + bool + select UNWIND_USER + config HAVE_PERF_REGS bool help diff --git a/include/asm-generic/Kbuild b/include/asm-generic/Kbuild index 8675b7b4ad23..295c94a3ccc1 100644 --- a/include/asm-generic/Kbuild +++ b/include/asm-generic/Kbuild @@ -59,6 +59,7 @@ mandatory-y +=3D tlbflush.h mandatory-y +=3D topology.h mandatory-y +=3D trace_clock.h mandatory-y +=3D uaccess.h +mandatory-y +=3D unwind_user.h mandatory-y +=3D vermagic.h mandatory-y +=3D vga.h mandatory-y +=3D video.h diff --git a/include/asm-generic/unwind_user.h b/include/asm-generic/unwind= _user.h new file mode 100644 index 000000000000..b8882b909944 --- /dev/null +++ b/include/asm-generic/unwind_user.h @@ -0,0 +1,5 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_GENERIC_UNWIND_USER_H +#define _ASM_GENERIC_UNWIND_USER_H + +#endif /* _ASM_GENERIC_UNWIND_USER_H */ diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h new file mode 100644 index 000000000000..7f7282516bf5 --- /dev/null +++ b/include/linux/unwind_user.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_UNWIND_USER_H +#define _LINUX_UNWIND_USER_H + +#include +#include + +#ifndef ARCH_INIT_USER_FP_FRAME + #define ARCH_INIT_USER_FP_FRAME +#endif + +int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries); + +#endif /* _LINUX_UNWIND_USER_H */ diff --git a/include/linux/unwind_user_types.h b/include/linux/unwind_user_= types.h new file mode 100644 index 000000000000..a449f15be890 --- /dev/null +++ b/include/linux/unwind_user_types.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_UNWIND_USER_TYPES_H +#define _LINUX_UNWIND_USER_TYPES_H + +#include + +/* + * Unwind types, listed in priority order: lower numbers are attempted fir= st if + * available. + */ +enum unwind_user_type_bits { + UNWIND_USER_TYPE_FP_BIT =3D 0, + + NR_UNWIND_USER_TYPE_BITS, +}; + +enum unwind_user_type { + /* Type "none" for the start of stack walk iteration. */ + UNWIND_USER_TYPE_NONE =3D 0, + UNWIND_USER_TYPE_FP =3D BIT(UNWIND_USER_TYPE_FP_BIT), +}; + +struct unwind_stacktrace { + unsigned int nr; + unsigned long *entries; +}; + +struct unwind_user_frame { + s32 cfa_off; + s32 ra_off; + s32 fp_off; + bool use_fp; +}; + +struct unwind_user_state { + unsigned long ip; + unsigned long sp; + unsigned long fp; + enum unwind_user_type current_type; + unsigned int available_types; + bool done; +}; + +#endif /* _LINUX_UNWIND_USER_TYPES_H */ diff --git a/kernel/Makefile b/kernel/Makefile index 32e80dd626af..541186050251 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -55,6 +55,7 @@ obj-y +=3D rcu/ obj-y +=3D livepatch/ obj-y +=3D dma/ obj-y +=3D entry/ +obj-y +=3D unwind/ obj-$(CONFIG_MODULES) +=3D module/ =20 obj-$(CONFIG_KCMP) +=3D kcmp.o diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile new file mode 100644 index 000000000000..349ce3677526 --- /dev/null +++ b/kernel/unwind/Makefile @@ -0,0 +1 @@ + obj-$(CONFIG_UNWIND_USER) +=3D user.o diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c new file mode 100644 index 000000000000..85b8c764d2f7 --- /dev/null +++ b/kernel/unwind/user.c @@ -0,0 +1,128 @@ +// SPDX-License-Identifier: GPL-2.0 +/* +* Generic interfaces for unwinding user space +*/ +#include +#include +#include +#include +#include + +static struct unwind_user_frame fp_frame =3D { + ARCH_INIT_USER_FP_FRAME +}; + +#define for_each_user_frame(state) \ + for (unwind_user_start(state); !(state)->done; unwind_user_next(state)) + +static int unwind_user_next_fp(struct unwind_user_state *state) +{ + struct unwind_user_frame *frame =3D &fp_frame; + unsigned long cfa, fp, ra =3D 0; + unsigned int shift; + + if (frame->use_fp) { + if (state->fp < state->sp) + return -EINVAL; + cfa =3D state->fp; + } else { + cfa =3D state->sp; + } + + /* Get the Canonical Frame Address (CFA) */ + cfa +=3D frame->cfa_off; + + /* stack going in wrong direction? */ + if (cfa <=3D state->sp) + return -EINVAL; + + /* Make sure that the address is word aligned */ + shift =3D sizeof(long) =3D=3D 4 ? 2 : 3; + if (cfa & ((1 << shift) - 1)) + return -EINVAL; + + /* Find the Return Address (RA) */ + if (get_user(ra, (unsigned long *)(cfa + frame->ra_off))) + return -EINVAL; + + if (frame->fp_off && get_user(fp, (unsigned long __user *)(cfa + frame->f= p_off))) + return -EINVAL; + + state->ip =3D ra; + state->sp =3D cfa; + if (frame->fp_off) + state->fp =3D fp; + return 0; +} + +static int unwind_user_next(struct unwind_user_state *state) +{ + unsigned long iter_mask =3D state->available_types; + unsigned int bit; + + if (state->done) + return -EINVAL; + + for_each_set_bit(bit, &iter_mask, NR_UNWIND_USER_TYPE_BITS) { + enum unwind_user_type type =3D BIT(bit); + + state->current_type =3D type; + switch (type) { + case UNWIND_USER_TYPE_FP: + if (!unwind_user_next_fp(state)) + return 0; + continue; + default: + WARN_ONCE(1, "Undefined unwind bit %d", bit); + break; + } + break; + } + + /* No successful unwind method. */ + state->current_type =3D UNWIND_USER_TYPE_NONE; + state->done =3D true; + return -EINVAL; +} + +static int unwind_user_start(struct unwind_user_state *state) +{ + struct pt_regs *regs =3D task_pt_regs(current); + + memset(state, 0, sizeof(*state)); + + if ((current->flags & PF_KTHREAD) || !user_mode(regs)) { + state->done =3D true; + return -EINVAL; + } + + if (IS_ENABLED(CONFIG_HAVE_UNWIND_USER_FP)) + state->available_types |=3D UNWIND_USER_TYPE_FP; + + state->ip =3D instruction_pointer(regs); + state->sp =3D user_stack_pointer(regs); + state->fp =3D frame_pointer(regs); + + return 0; +} + +int unwind_user(struct unwind_stacktrace *trace, unsigned int max_entries) +{ + struct unwind_user_state state; + + trace->nr =3D 0; + + if (!max_entries) + return -EINVAL; + + if (current->flags & PF_KTHREAD) + return 0; + + for_each_user_frame(&state) { + trace->entries[trace->nr++] =3D state.ip; + if (trace->nr >=3D max_entries) + break; + } + + return 0; +} --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EDEF3244669 for ; Sat, 26 Jul 2025 14:12:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; cv=none; b=LWlMYXsI4bFJEn0sxQG23Qm66rJUsG6gY21a8Jjdn6q9GplKxb8HNs13uz8RIWW/1Y8FBlOPOvFOHtliCd3cmHFo64EUxbUWc8zY7lGY17UgviQxQcN404FfrHsKAO0CgcYrUQiXjnSVHSoISNnx7Cem2+fMoHKAXPG3PYumEOc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; c=relaxed/simple; bh=Z+NsadWCSnSgGC9LYBVfHEg0jqt1nHO/7eY01nAv8yo=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=l+5V252REfJLGVB10dRlPuA770HQ1d3h3Y5AD1rL6Oa/k15xv+aE4Bu1G91J1lfxRD4Dfjbe1yJv+rkQqBjB3E/4irw9mTMmHIVCA8pvP5OwpSCT6AekAtsZfaxIFXl5Y2IkhxgQhGTq+zOGDhHhb8X8OHRc7RN+gPA2Gf0jBBo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=isHI0gea; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="isHI0gea" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8E096C4CEF8; Sat, 26 Jul 2025 14:12:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539136; bh=Z+NsadWCSnSgGC9LYBVfHEg0jqt1nHO/7eY01nAv8yo=; h=Date:From:To:Cc:Subject:References:From; b=isHI0geaenDHFgHKTo0X0EmBnKFEKAIrzjol936XixQ68wPQ/LdZVNhOZvkn3Ua4C idciWBtx30vVUq6iNxmWjZ04WpRc7wzycnloI2tfnyvACJ/fSNNPeDuGrLvSwi2ZRq 1voiA5CVbFVu4XL49YJsn+7aIdVOhnlEuQEFQH2bMObrgg2TmMxjQXQgMW1wRf8Kod zCiHhFsJUVwZEQevrKZssgUjs32GKxVA+28wXSAQvtRVwe3djrYNAiJ3mIgK/+ZUP/ yiEhS6yNM1em9u+RdEckOKnYWnjU2OmdlsojsRq5qi44S8Oj5VSOkOxDKOXiONcTQY zGK8hffQziX7w== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdQ-00000001sc9-16Fb; Sat, 26 Jul 2025 10:12:24 -0400 Message-ID: <20250726141224.113518204@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:06 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 02/10] unwind_user/deferred: Add unwind_user_faultable() References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Add a new API to retrieve a user space callstack called unwind_user_faultable(). The difference between this user space stack tracer from the current user space stack tracer is that this must be called from faultable context as it may use routines to access user space data that needs to be faulted in. It can be safely called from entering or exiting a system call as the code can still be faulted in there. This code is based on work by Josh Poimboeuf's deferred unwinding code: Link: https://lore.kernel.org/all/6052e8487746603bdb29b65f4033e739092d9925.= 1737511963.git.jpoimboe@kernel.org/ Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Josh Poimboeuf Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185739.399622407@kernel.org Signed-off-by: Steven Rostedt (Google) --- include/linux/sched.h | 5 +++ include/linux/unwind_deferred.h | 24 +++++++++++ include/linux/unwind_deferred_types.h | 9 ++++ kernel/fork.c | 4 ++ kernel/unwind/Makefile | 2 +- kernel/unwind/deferred.c | 60 +++++++++++++++++++++++++++ 6 files changed, 103 insertions(+), 1 deletion(-) create mode 100644 include/linux/unwind_deferred.h create mode 100644 include/linux/unwind_deferred_types.h create mode 100644 kernel/unwind/deferred.c diff --git a/include/linux/sched.h b/include/linux/sched.h index 4f78a64beb52..59fdf7d9bb1e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -46,6 +46,7 @@ #include #include #include +#include #include =20 /* task_struct member predeclarations (sorted alphabetically): */ @@ -1654,6 +1655,10 @@ struct task_struct { struct user_event_mm *user_event_mm; #endif =20 +#ifdef CONFIG_UNWIND_USER + struct unwind_task_info unwind_info; +#endif + /* CPU-specific state of this task: */ struct thread_struct thread; =20 diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h new file mode 100644 index 000000000000..a5f6e8f8a1a2 --- /dev/null +++ b/include/linux/unwind_deferred.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_UNWIND_USER_DEFERRED_H +#define _LINUX_UNWIND_USER_DEFERRED_H + +#include +#include + +#ifdef CONFIG_UNWIND_USER + +void unwind_task_init(struct task_struct *task); +void unwind_task_free(struct task_struct *task); + +int unwind_user_faultable(struct unwind_stacktrace *trace); + +#else /* !CONFIG_UNWIND_USER */ + +static inline void unwind_task_init(struct task_struct *task) {} +static inline void unwind_task_free(struct task_struct *task) {} + +static inline int unwind_user_faultable(struct unwind_stacktrace *trace) {= return -ENOSYS; } + +#endif /* !CONFIG_UNWIND_USER */ + +#endif /* _LINUX_UNWIND_USER_DEFERRED_H */ diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_d= eferred_types.h new file mode 100644 index 000000000000..aa32db574e43 --- /dev/null +++ b/include/linux/unwind_deferred_types.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H +#define _LINUX_UNWIND_USER_DEFERRED_TYPES_H + +struct unwind_task_info { + unsigned long *entries; +}; + +#endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 1ee8eb11f38b..3341d50c61f2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -105,6 +105,7 @@ #include #include #include +#include =20 #include #include @@ -732,6 +733,7 @@ void __put_task_struct(struct task_struct *tsk) WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk =3D=3D current); =20 + unwind_task_free(tsk); sched_ext_free(tsk); io_uring_free(tsk); cgroup_free(tsk); @@ -2135,6 +2137,8 @@ __latent_entropy struct task_struct *copy_process( p->bpf_ctx =3D NULL; #endif =20 + unwind_task_init(p); + /* Perform scheduler related setup. Assign this task to a CPU. */ retval =3D sched_fork(clone_flags, p); if (retval) diff --git a/kernel/unwind/Makefile b/kernel/unwind/Makefile index 349ce3677526..eae37bea54fd 100644 --- a/kernel/unwind/Makefile +++ b/kernel/unwind/Makefile @@ -1 +1 @@ - obj-$(CONFIG_UNWIND_USER) +=3D user.o + obj-$(CONFIG_UNWIND_USER) +=3D user.o deferred.o diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c new file mode 100644 index 000000000000..a0badbeb3cc1 --- /dev/null +++ b/kernel/unwind/deferred.c @@ -0,0 +1,60 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Deferred user space unwinding + */ +#include +#include +#include +#include + +#define UNWIND_MAX_ENTRIES 512 + +/** + * unwind_user_faultable - Produce a user stacktrace in faultable context + * @trace: The descriptor that will store the user stacktrace + * + * This must be called in a known faultable context (usually when entering + * or exiting user space). Depending on the available implementations + * the @trace will be loaded with the addresses of the user space stacktra= ce + * if it can be found. + * + * Return: 0 on success and negative on error + * On success @trace will contain the user space stacktrace + */ +int unwind_user_faultable(struct unwind_stacktrace *trace) +{ + struct unwind_task_info *info =3D ¤t->unwind_info; + + /* Should always be called from faultable context */ + might_fault(); + + if (current->flags & PF_EXITING) + return -EINVAL; + + if (!info->entries) { + info->entries =3D kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long), + GFP_KERNEL); + if (!info->entries) + return -ENOMEM; + } + + trace->nr =3D 0; + trace->entries =3D info->entries; + unwind_user(trace, UNWIND_MAX_ENTRIES); + + return 0; +} + +void unwind_task_init(struct task_struct *task) +{ + struct unwind_task_info *info =3D &task->unwind_info; + + memset(info, 0, sizeof(*info)); +} + +void unwind_task_free(struct task_struct *task) +{ + struct unwind_task_info *info =3D &task->unwind_info; + + kfree(info->entries); +} --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2B799286D46 for ; Sat, 26 Jul 2025 14:12:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539138; cv=none; b=ooHaD/67Lm9lsoJS9Hy0z9yBCZB0vr48HgxgGSoOY5+xUT5km2Wv4IXDBA1wiQC8cUMXwxySWlXw1O3AcilQCLvkQ+fDAxFhPjiXpGv6Ly0Y/Q7wwuG0haYJO/Q8B/vdu4z5QzPsrJkWbKAY3FUG95b+cdmhUYZjpz/xHcTWukg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539138; c=relaxed/simple; bh=t/CEEBRPhvxVVGHJeNn7/2GlwL6AoTF6Q9DykHyBPls=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=KqT1+BGjwYXmq3qtqMpUi4x0kO7ld/O3kCiPDmq3oIbvI+AYo6rdvVxNA6DXZzZT6tXlv6ouUlANZgdPktLfI5wylrr70ovqVCaKAqFa6XSeCuCmaUhdc2xu02TUfW0Ub6ENx00ZfdIi1zu64mAJsQQjOReOPcALoHy71QjS74s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nDnpMIs+; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nDnpMIs+" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B6AB2C4CEFA; Sat, 26 Jul 2025 14:12:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539136; bh=t/CEEBRPhvxVVGHJeNn7/2GlwL6AoTF6Q9DykHyBPls=; h=Date:From:To:Cc:Subject:References:From; b=nDnpMIs+LZ13MKQ/eq1kOiztL6ScD8vjFiKvrswjm/493ruhGkecfzbWtpMd5bkvP X+07ib/D5rpqDoDVdND2ZFrJyR5ttofLkOse1wfzeSMmkU5pZMfij+AZoh1nsLuyS1 HPrT/8UvKW0V4uhZvVJcUneYSH4mwTFOtCX4Eub1GXsVaQzAhfqNLozqsTcTeWOAdU +LLkeOJySf6jO3thNorl/0NW8Tx2dOsgypWj0So5ZtCTQc4gXTqaqE5+x2hbfuvOnh jcOIisv9FfuiG0Nqv7wHWRUFO+LeIRD2dUOhWC0S++JQ2J2xyMMydOzsRVenp7kyBm bsvnbwNW8hF/w== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdQ-00000001scd-1oTa; Sat, 26 Jul 2025 10:12:24 -0400 Message-ID: <20250726141224.284101839@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:07 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 03/10] unwind_user/deferred: Add unwind cache References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Josh Poimboeuf Cache the results of the unwind to ensure the unwind is only performed once, even when called by multiple tracers. The cache nr_entries gets cleared every time the task exits the kernel. When a stacktrace is requested, nr_entries gets set to the number of entries in the stacktrace. If another stacktrace is requested, if nr_entries is not zero, then it contains the same stacktrace that would be retrieved so it is not processed again and the entries is given to the caller. Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185739.573388765@kernel.org Co-developed-by: Steven Rostedt (Google) Signed-off-by: Josh Poimboeuf Signed-off-by: Steven Rostedt (Google) --- include/linux/entry-common.h | 2 ++ include/linux/unwind_deferred.h | 8 +++++++ include/linux/unwind_deferred_types.h | 7 +++++- kernel/unwind/deferred.c | 31 +++++++++++++++++++++------ 4 files changed, 40 insertions(+), 8 deletions(-) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index f94f3fdf15fc..8908b8eeb99b 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -12,6 +12,7 @@ #include #include #include +#include =20 #include #include @@ -362,6 +363,7 @@ static __always_inline void exit_to_user_mode(void) lockdep_hardirqs_on_prepare(); instrumentation_end(); =20 + unwind_reset_info(); user_enter_irqoff(); arch_exit_to_user_mode(); lockdep_hardirqs_on(CALLER_ADDR0); diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h index a5f6e8f8a1a2..baacf4a1eb4c 100644 --- a/include/linux/unwind_deferred.h +++ b/include/linux/unwind_deferred.h @@ -12,6 +12,12 @@ void unwind_task_free(struct task_struct *task); =20 int unwind_user_faultable(struct unwind_stacktrace *trace); =20 +static __always_inline void unwind_reset_info(void) +{ + if (unlikely(current->unwind_info.cache)) + current->unwind_info.cache->nr_entries =3D 0; +} + #else /* !CONFIG_UNWIND_USER */ =20 static inline void unwind_task_init(struct task_struct *task) {} @@ -19,6 +25,8 @@ static inline void unwind_task_free(struct task_struct *t= ask) {} =20 static inline int unwind_user_faultable(struct unwind_stacktrace *trace) {= return -ENOSYS; } =20 +static inline void unwind_reset_info(void) {} + #endif /* !CONFIG_UNWIND_USER */ =20 #endif /* _LINUX_UNWIND_USER_DEFERRED_H */ diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_d= eferred_types.h index aa32db574e43..db5b54b18828 100644 --- a/include/linux/unwind_deferred_types.h +++ b/include/linux/unwind_deferred_types.h @@ -2,8 +2,13 @@ #ifndef _LINUX_UNWIND_USER_DEFERRED_TYPES_H #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H =20 +struct unwind_cache { + unsigned int nr_entries; + unsigned long entries[]; +}; + struct unwind_task_info { - unsigned long *entries; + struct unwind_cache *cache; }; =20 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */ diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index a0badbeb3cc1..96368a5aa522 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -4,10 +4,13 @@ */ #include #include +#include #include #include =20 -#define UNWIND_MAX_ENTRIES 512 +/* Make the cache fit in a 4K page */ +#define UNWIND_MAX_ENTRIES \ + ((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long)) =20 /** * unwind_user_faultable - Produce a user stacktrace in faultable context @@ -24,6 +27,7 @@ int unwind_user_faultable(struct unwind_stacktrace *trace) { struct unwind_task_info *info =3D ¤t->unwind_info; + struct unwind_cache *cache; =20 /* Should always be called from faultable context */ might_fault(); @@ -31,17 +35,30 @@ int unwind_user_faultable(struct unwind_stacktrace *tra= ce) if (current->flags & PF_EXITING) return -EINVAL; =20 - if (!info->entries) { - info->entries =3D kmalloc_array(UNWIND_MAX_ENTRIES, sizeof(long), - GFP_KERNEL); - if (!info->entries) + if (!info->cache) { + info->cache =3D kzalloc(struct_size(cache, entries, UNWIND_MAX_ENTRIES), + GFP_KERNEL); + if (!info->cache) return -ENOMEM; } =20 + cache =3D info->cache; + trace->entries =3D cache->entries; + + if (cache->nr_entries) { + /* + * The user stack has already been previously unwound in this + * entry context. Skip the unwind and use the cache. + */ + trace->nr =3D cache->nr_entries; + return 0; + } + trace->nr =3D 0; - trace->entries =3D info->entries; unwind_user(trace, UNWIND_MAX_ENTRIES); =20 + cache->nr_entries =3D trace->nr; + return 0; } =20 @@ -56,5 +73,5 @@ void unwind_task_free(struct task_struct *task) { struct unwind_task_info *info =3D &task->unwind_info; =20 - kfree(info->entries); + kfree(info->cache); } --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 68D2D2877C3 for ; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539139; cv=none; b=sovU3tCuwC2Ge5lq2QNmG1QD8LAiSqSwQN+Y6Tf0xkZpt/BAWuPordzaDyj/irAd9yVsK7phcpPgi1BcV9K1WhRXFzRwvRcUuL7q1hbGQ08SpZ0Y8DLbMinCbn+z0FTvSgbtwfCZi4sAL4LrFDzxt79wTrx8J62ZqOJdAUQmUt0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539139; c=relaxed/simple; bh=0yO76UeUtpUnijyJl3Uaim0kKtCyEhC+y6hyoKvrC0M=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=Vc4bGgo03H92OqlOp9P6dJm25AX0M6VJ8wjZMV5Co9hgrHPGldaaqHkgfRPIRc+wlViVB3H3TdSwPPg28hGtp4KGKPmzZc6v6aoiWmDjGPkC2s7fjwIoXEh26hkJf2z/Kt3IYMuIBW2wdK7PWJEr7ucL8BqO+btLp4TL0tJ8YMw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=iIcR+YC3; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="iIcR+YC3" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DC771C4AF0C; Sat, 26 Jul 2025 14:12:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539137; bh=0yO76UeUtpUnijyJl3Uaim0kKtCyEhC+y6hyoKvrC0M=; h=Date:From:To:Cc:Subject:References:From; b=iIcR+YC3LCVBlLrHnv9oScdMVVjUSLH0VMclD21FWzMiVOKEB85xSJWSu4h46P8+z 2cT8QF2RUwXNIY/vufH67GW31mFZZtiMDsoYWSHEj/un8raB8ohYB84siJoyk7YIfg XweqyaxmTIebUFrG8spyoOOcYTHsxe3Bc228i7hOfqNF4zpKbhnHH5p264fyhdmfgq jadSfy/0fyd9G1Ac8NvevIIYun1QGJHeSoFWwgfX4EOaMCa+JUrzB7142tM4E629Lo 4GhiT1hyp6c6flpm1w7LVkXVaYNVlBJ88HfagFJg0qEpfp1W/Ot+MF8x1Los29mAMb LRtjZh5gGIsvg== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdQ-00000001sd7-2XaO; Sat, 26 Jul 2025 10:12:24 -0400 Message-ID: <20250726141224.454720903@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:08 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 04/10] unwind_user/deferred: Add deferred unwinding interface References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Josh Poimboeuf Add an interface for scheduling task work to unwind the user space stack before returning to user space. This solves several problems for its callers: - Ensure the unwind happens in task context even if the caller may be running in interrupt context. - Avoid duplicate unwinds, whether called multiple times by the same caller or by different callers. - Create a "context cookie" which allows trace post-processing to correlate kernel unwinds/traces with the user unwind. A concept of a "cookie" is created to detect when the stacktrace is the same. A cookie is generated the first time a user space stacktrace is requested after the task enters the kernel. As the stacktrace is saved on the task_struct while the task is in the kernel, if another request comes in, if the cookie is still the same, it will use the saved stacktrace, and not have to regenerate one. The cookie is passed to the caller on request, and when the stacktrace is generated upon returning to user space, it calls the requester's callback with the cookie as well as the stacktrace. The cookie is cleared when it goes back to user space. Note, this currently adds another conditional to the unwind_reset_info() path that is always called returning to user space, but future changes will put this back to a single conditional. A global list is created and protected by a global mutex that holds tracers that register with the unwind infrastructure. The number of registered tracers will be limited in future changes. Each perf program or ftrace instance will register its own descriptor to use for deferred unwind stack traces. Note, in the function unwind_deferred_task_work() that gets called when returning to user space, it uses a global mutex for synchronization which will cause a big bottleneck. This will be replaced by SRCU, but that change adds some complex synchronization that deservers its own commit. Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185739.735072631@kernel.org Co-developed-by: Steven Rostedt (Google) Signed-off-by: Josh Poimboeuf Signed-off-by: Steven Rostedt (Google) --- include/linux/unwind_deferred.h | 24 ++++ include/linux/unwind_deferred_types.h | 24 ++++ kernel/unwind/deferred.c | 156 +++++++++++++++++++++++++- 3 files changed, 203 insertions(+), 1 deletion(-) diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h index baacf4a1eb4c..14efd8c027aa 100644 --- a/include/linux/unwind_deferred.h +++ b/include/linux/unwind_deferred.h @@ -2,9 +2,19 @@ #ifndef _LINUX_UNWIND_USER_DEFERRED_H #define _LINUX_UNWIND_USER_DEFERRED_H =20 +#include #include #include =20 +struct unwind_work; + +typedef void (*unwind_callback_t)(struct unwind_work *work, struct unwind_= stacktrace *trace, u64 cookie); + +struct unwind_work { + struct list_head list; + unwind_callback_t func; +}; + #ifdef CONFIG_UNWIND_USER =20 void unwind_task_init(struct task_struct *task); @@ -12,8 +22,19 @@ void unwind_task_free(struct task_struct *task); =20 int unwind_user_faultable(struct unwind_stacktrace *trace); =20 +int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func); +int unwind_deferred_request(struct unwind_work *work, u64 *cookie); +void unwind_deferred_cancel(struct unwind_work *work); + static __always_inline void unwind_reset_info(void) { + if (unlikely(current->unwind_info.id.id)) + current->unwind_info.id.id =3D 0; + /* + * As unwind_user_faultable() can be called directly and + * depends on nr_entries being cleared on exit to user, + * this needs to be a separate conditional. + */ if (unlikely(current->unwind_info.cache)) current->unwind_info.cache->nr_entries =3D 0; } @@ -24,6 +45,9 @@ static inline void unwind_task_init(struct task_struct *t= ask) {} static inline void unwind_task_free(struct task_struct *task) {} =20 static inline int unwind_user_faultable(struct unwind_stacktrace *trace) {= return -ENOSYS; } +static inline int unwind_deferred_init(struct unwind_work *work, unwind_ca= llback_t func) { return -ENOSYS; } +static inline int unwind_deferred_request(struct unwind_work *work, u64 *t= imestamp) { return -ENOSYS; } +static inline void unwind_deferred_cancel(struct unwind_work *work) {} =20 static inline void unwind_reset_info(void) {} =20 diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_d= eferred_types.h index db5b54b18828..104c477d5609 100644 --- a/include/linux/unwind_deferred_types.h +++ b/include/linux/unwind_deferred_types.h @@ -7,8 +7,32 @@ struct unwind_cache { unsigned long entries[]; }; =20 +/* + * The unwind_task_id is a unique identifier that maps to a user space + * stacktrace. It is generated the first time a deferred user space + * stacktrace is requested after a task has entered the kerenl and + * is cleared to zero when it exits. The mapped id will be a non-zero + * number. + * + * To simplify the generation of the 64 bit number, 32 bits will be + * the CPU it was generated on, and the other 32 bits will be a per + * cpu counter that gets incremented by two every time a new identifier + * is generated. The LSB will always be set to keep the value + * from being zero. + */ +union unwind_task_id { + struct { + u32 cpu; + u32 cnt; + }; + u64 id; +}; + struct unwind_task_info { struct unwind_cache *cache; + struct callback_head work; + union unwind_task_id id; + int pending; }; =20 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */ diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index 96368a5aa522..2cbae2ada309 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -2,16 +2,63 @@ /* * Deferred user space unwinding */ +#include +#include +#include +#include #include #include #include #include -#include +#include =20 /* Make the cache fit in a 4K page */ #define UNWIND_MAX_ENTRIES \ ((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long)) =20 +/* Guards adding to and reading the list of callbacks */ +static DEFINE_MUTEX(callback_mutex); +static LIST_HEAD(callbacks); + +/* + * This is a unique percpu identifier for a given task entry context. + * Conceptually, it's incremented every time the CPU enters the kernel from + * user space, so that each "entry context" on the CPU gets a unique ID. = In + * reality, as an optimization, it's only incremented on demand for the fi= rst + * deferred unwind request after a given entry-from-user. + * + * It's combined with the CPU id to make a systemwide-unique "context cook= ie". + */ +static DEFINE_PER_CPU(u32, unwind_ctx_ctr); + +/* + * The context cookie is a unique identifier that is assigned to a user + * space stacktrace. As the user space stacktrace remains the same while + * the task is in the kernel, the cookie is an identifier for the stacktra= ce. + * Although it is possible for the stacktrace to get another cookie if ano= ther + * request is made after the cookie was cleared and before reentering user + * space. + */ +static u64 get_cookie(struct unwind_task_info *info) +{ + u32 cnt =3D 1; + u32 old =3D 0; + + if (info->id.cpu) + return info->id.id; + + /* LSB is always set to ensure 0 is an invalid value */ + cnt |=3D __this_cpu_read(unwind_ctx_ctr) + 2; + if (try_cmpxchg(&info->id.cnt, &old, cnt)) { + /* Update the per cpu counter */ + __this_cpu_write(unwind_ctx_ctr, cnt); + } + /* Interrupts are disabled, the CPU will always be same */ + info->id.cpu =3D smp_processor_id() + 1; /* Must be non zero */ + + return info->id.id; +} + /** * unwind_user_faultable - Produce a user stacktrace in faultable context * @trace: The descriptor that will store the user stacktrace @@ -62,11 +109,117 @@ int unwind_user_faultable(struct unwind_stacktrace *t= race) return 0; } =20 +static void unwind_deferred_task_work(struct callback_head *head) +{ + struct unwind_task_info *info =3D container_of(head, struct unwind_task_i= nfo, work); + struct unwind_stacktrace trace; + struct unwind_work *work; + u64 cookie; + + if (WARN_ON_ONCE(!info->pending)) + return; + + /* Allow work to come in again */ + WRITE_ONCE(info->pending, 0); + + /* + * From here on out, the callback must always be called, even if it's + * just an empty trace. + */ + trace.nr =3D 0; + trace.entries =3D NULL; + + unwind_user_faultable(&trace); + + cookie =3D info->id.id; + + guard(mutex)(&callback_mutex); + list_for_each_entry(work, &callbacks, list) { + work->func(work, &trace, cookie); + } +} + +/** + * unwind_deferred_request - Request a user stacktrace on task kernel exit + * @work: Unwind descriptor requesting the trace + * @cookie: The cookie of the first request made for this task + * + * Schedule a user space unwind to be done in task work before exiting the + * kernel. + * + * The returned @cookie output is the generated cookie of the very first + * request for a user space stacktrace for this task since it entered the + * kernel. It can be from a request by any caller of this infrastructure. + * Its value will also be passed to the callback function. It can be + * used to stitch kernel and user stack traces together in post-processing. + * + * It's valid to call this function multiple times for the same @work with= in + * the same task entry context. Each call will return the same cookie + * while the task hasn't left the kernel. If the callback is not pending + * because it has already been previously called for the same entry contex= t, + * it will be called again with the same stack trace and cookie. + * + * Return: 1 if the the callback was already queued. + * 0 if the callback successfully was queued. + * Negative if there's an error. + * @cookie holds the cookie of the first request by any user + */ +int unwind_deferred_request(struct unwind_work *work, u64 *cookie) +{ + struct unwind_task_info *info =3D ¤t->unwind_info; + int ret; + + *cookie =3D 0; + + if (WARN_ON_ONCE(in_nmi())) + return -EINVAL; + + if ((current->flags & (PF_KTHREAD | PF_EXITING)) || + !user_mode(task_pt_regs(current))) + return -EINVAL; + + guard(irqsave)(); + + *cookie =3D get_cookie(info); + + /* callback already pending? */ + if (info->pending) + return 1; + + /* The work has been claimed, now schedule it. */ + ret =3D task_work_add(current, &info->work, TWA_RESUME); + if (WARN_ON_ONCE(ret)) + return ret; + + info->pending =3D 1; + return 0; +} + +void unwind_deferred_cancel(struct unwind_work *work) +{ + if (!work) + return; + + guard(mutex)(&callback_mutex); + list_del(&work->list); +} + +int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) +{ + memset(work, 0, sizeof(*work)); + + guard(mutex)(&callback_mutex); + list_add(&work->list, &callbacks); + work->func =3D func; + return 0; +} + void unwind_task_init(struct task_struct *task) { struct unwind_task_info *info =3D &task->unwind_info; =20 memset(info, 0, sizeof(*info)); + init_task_work(&info->work, unwind_deferred_task_work); } =20 void unwind_task_free(struct task_struct *task) @@ -74,4 +227,5 @@ void unwind_task_free(struct task_struct *task) struct unwind_task_info *info =3D &task->unwind_info; =20 kfree(info->cache); + task_work_cancel(task, &info->work); } --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 33C8F245000 for ; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; cv=none; b=d2/XbMQewFLgWGZuYQn2bHqbhDapU9DgZNRnxg4DESFKSD9plBhfb+WvaWwwserssr4ky6DhlF9odMS622exnaNdCfAxvFIS0+/a2mkfMr7EHwGvSpva9ahp22QBorRhlIOcnMO5J3Mud6I0B3WwFsyQ151jLU2r2m04PTotRiA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; c=relaxed/simple; bh=k8G4+w7Lveoc7WwBZA/bdbyFE3GFQKX3cxXTnI8tWgs=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=YaukqxSHClnYZfbxNpiLEa18pla1aXA72T99Oe46F/KbquH01wwmxOQfmd/rQ7cTh0Ks/NVsYTKPQs1xBOZpmo6LpO1mIpV/h6ojc8u3IdyKLe1yxewJWPIrtbvhR56filoogEvvcwIu0WbF2G4LvOHdQxHkvn4mqKjswYdx3AM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PhMvs9b4; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PhMvs9b4" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 08AEDC4CEF4; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539137; bh=k8G4+w7Lveoc7WwBZA/bdbyFE3GFQKX3cxXTnI8tWgs=; h=Date:From:To:Cc:Subject:References:From; b=PhMvs9b4xCwX2v0A3ALQIc28l81BRdwZfUfeqRBtatrQDyv3J7S+0WzK93sm0OiBa buD6WcXCUuFAgi74//jdx8Zc/Kt7biy5j9nBplo3rpYiux2ZD0hfa0Db1DHTzP4jxP 86U/Ib1o81bo2H8SsyikFDgCPbZwtnMxKtchzHVz+SI15sj/EDZo4wHSCcsfLxnkj1 5g3TROX5C+IpzZZqL18aSdMQSdkouqvv2DynUGfAg+T0gvqQarKCCRCKRgxt2HXKYh hTFih9WQF8u2IP6ZNEtqYc252kzkKoEF/S+uaiC8rOfMkeMRw3zLgG0kvudp6FNLWF A4LdFmOW4777w== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdQ-00000001sdc-3F5e; Sat, 26 Jul 2025 10:12:24 -0400 Message-ID: <20250726141224.627257236@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:09 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 05/10] unwind_user/deferred: Make unwind deferral requests NMI-safe References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Make unwind_deferred_request() NMI-safe so tracers in NMI context can call it and safely request a user space stacktrace when the task exits. Note, this is only allowed for architectures that implement a safe cmpxchg. If an architecture requests a deferred stack trace from NMI context that does not support a safe NMI cmpxchg, it will get an -EINVAL and trigger a warning. For those architectures, they would need another method (perhaps an irqwork), to request a deferred user space stack trace. That can be dealt with later if one of theses architectures require this feature. Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Josh Poimboeuf Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185739.906767342@kernel.org Suggested-by: Peter Zijlstra Signed-off-by: Steven Rostedt (Google) --- kernel/unwind/deferred.c | 52 +++++++++++++++++++++++++++++++++------- 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index 2cbae2ada309..c5ac087d2396 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -12,6 +12,31 @@ #include #include =20 +/* + * For requesting a deferred user space stack trace from NMI context + * the architecture must support a safe cmpxchg in NMI context. + * For those architectures that do not have that, then it cannot ask + * for a deferred user space stack trace from an NMI context. If it + * does, then it will get -EINVAL. + */ +#if defined(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) +# define CAN_USE_IN_NMI 1 +static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt) +{ + u32 old =3D 0; + + return try_cmpxchg(&info->id.cnt, &old, cnt); +} +#else +# define CAN_USE_IN_NMI 0 +/* When NMIs are not allowed, this always succeeds */ +static inline bool try_assign_cnt(struct unwind_task_info *info, u32 cnt) +{ + info->id.cnt =3D cnt; + return true; +} +#endif + /* Make the cache fit in a 4K page */ #define UNWIND_MAX_ENTRIES \ ((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long)) @@ -42,14 +67,13 @@ static DEFINE_PER_CPU(u32, unwind_ctx_ctr); static u64 get_cookie(struct unwind_task_info *info) { u32 cnt =3D 1; - u32 old =3D 0; =20 if (info->id.cpu) return info->id.id; =20 /* LSB is always set to ensure 0 is an invalid value */ cnt |=3D __this_cpu_read(unwind_ctx_ctr) + 2; - if (try_cmpxchg(&info->id.cnt, &old, cnt)) { + if (try_assign_cnt(info, cnt)) { /* Update the per cpu counter */ __this_cpu_write(unwind_ctx_ctr, cnt); } @@ -167,31 +191,43 @@ static void unwind_deferred_task_work(struct callback= _head *head) int unwind_deferred_request(struct unwind_work *work, u64 *cookie) { struct unwind_task_info *info =3D ¤t->unwind_info; + long pending; int ret; =20 *cookie =3D 0; =20 - if (WARN_ON_ONCE(in_nmi())) - return -EINVAL; - if ((current->flags & (PF_KTHREAD | PF_EXITING)) || !user_mode(task_pt_regs(current))) return -EINVAL; =20 + /* + * NMI requires having safe cmpxchg operations. + * Trigger a warning to make it obvious that an architecture + * is using this in NMI when it should not be. + */ + if (WARN_ON_ONCE(!CAN_USE_IN_NMI && in_nmi())) + return -EINVAL; + guard(irqsave)(); =20 *cookie =3D get_cookie(info); =20 /* callback already pending? */ - if (info->pending) + pending =3D READ_ONCE(info->pending); + if (pending) + return 1; + + /* Claim the work unless an NMI just now swooped in to do so. */ + if (!try_cmpxchg(&info->pending, &pending, 1)) return 1; =20 /* The work has been claimed, now schedule it. */ ret =3D task_work_add(current, &info->work, TWA_RESUME); - if (WARN_ON_ONCE(ret)) + if (WARN_ON_ONCE(ret)) { + WRITE_ONCE(info->pending, 0); return ret; + } =20 - info->pending =3D 1; return 0; } =20 --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5A041247287 for ; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; cv=none; b=KeriL0Zs/ge+7etuc55sGf3OEDhfhS8eYhl+Cf6NSsCkmi6J5AN/4V5kMd+J+jiLlgmETxj0Tzz5LRRWZVNhZB9njxnaiGWFqUDRn4UHIFgTBfMkCTtOFGP3QLx8NCo739Ar91Z6IKV6VAKBhZdpDrCxMRg0gh3lw45DetZ1dXo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; c=relaxed/simple; bh=iwMyrLQI7SAD+9VmyhNKhoAPRwcUajb/NUxLqrwJUO8=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=oMEuX8VuQKUWSgSRf6KPoou2x6RzRhtrSI9muugonc+jGVqfcpz5MKWMlZeo79h4VncUDjhvjuDhmKVwHAfUj5+VssDxLjtZKQoraAAS2v0WI//SjduE9HkU/N9Em23jEhN4HOQRchfmrtfCFUoQJQUnTP0gNPWORgrJ/sl4uHg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=u48CHYoQ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="u48CHYoQ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 28083C4CEFB; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539137; bh=iwMyrLQI7SAD+9VmyhNKhoAPRwcUajb/NUxLqrwJUO8=; h=Date:From:To:Cc:Subject:References:From; b=u48CHYoQAIWzqNLJfWHnjN49hPCRBvn+F6BBZ6fTk6NKAr3H1ekvioGN09UfDSNq9 eJu6RxOZXMxYFas04r83i5IN9k8fW5BOsr6K1jpI3g5sQnPqFpWIyGStrp1DvMcMLn DaTRG4r4BNmdElIpnt5R96zNLtUHPVIojCcAfOie9/i89+7vzXjuwGkH4/2CTIiFIZ o8tIAT1/cocjrLrG9O0erP13gSqd3E/olk31/kFx/Yr2ZK7h4cuTz+tRxXLyHaWYlr UK9wAXfC/0yfVWFeMlG04fggdZEtno25HRbkSlwvdo5bl90uB6VVd9DCNJbMpeJxFC HqH3cZGXdFmbQ== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdQ-00000001se7-3wv8; Sat, 26 Jul 2025 10:12:24 -0400 Message-ID: <20250726141224.794538988@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:10 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 06/10] unwind deferred: Use bitmask to determine which callbacks to call References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt In order to know which registered callback requested a stacktrace for when the task goes back to user space, add a bitmask to keep track of all registered tracers. The bitmask is the size of long, which means that on a 32 bit machine, it can have at most 32 registered tracers, and on 64 bit, it can have at most 64 registered tracers. This should not be an issue as there should not be more than 10 (unless BPF can abuse this?). When a tracer registers with unwind_deferred_init() it will get a bit number assigned to it. When a tracer requests a stacktrace, it will have its bit set within the task_struct. When the task returns back to user space, it will call the callbacks for all the registered tracers where their bits are set in the task's mask. When a tracer is removed by the unwind_deferred_cancel() all current tasks will clear the associated bit, just in case another tracer gets registered immediately afterward and then gets their callback called unexpectedly. To prevent live locks from happening if an event that happens between the task_work and when the task goes back to user space, triggers the deferred unwind, have the unwind_mask get cleared on exit to user space and not after the callback is made. Move the pending bit from a value on the task_struct to bit zero of the unwind_mask (saves space on the task_struct). This will allow modifying the pending bit along with the work bits atomically. Instead of clearing a work's bit after its callback is called, it is delayed until exit. If the work is requested again, the task_work is not queued again and the request will be notified that the task has already been called by returning a positive number (the same as if it was already pending). The pending bit is cleared before calling the callback functions but the current work bits remain. If one of the called works registers again, it will not trigger a task_work if its bit is still present in the task's unwind_mask. If a new work requests a deferred unwind, then it will set both the pending bit and its own bit. Note this will also cause any work that was previously queued and had their callback already executed to be executed again. Future work will remove these spurious callbacks. The use of atomic_long bit operations were suggested by Peter Zijlstra: Link: https://lore.kernel.org/all/20250715102912.GQ1613200@noisy.programmin= g.kicks-ass.net/ The unwind_mask could not be converted to atomic_long_t do to atomic_long not having all the bit operations needed by unwind_mask. Instead it follows other use cases in the kernel and just typecasts the unwind_mask to atomic_long_t when using the two atomic_long functions. Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Josh Poimboeuf Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185740.077441188@kernel.org Signed-off-by: Steven Rostedt (Google) --- include/linux/unwind_deferred.h | 26 +++++++- include/linux/unwind_deferred_types.h | 2 +- kernel/unwind/deferred.c | 87 +++++++++++++++++++++------ 3 files changed, 92 insertions(+), 23 deletions(-) diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h index 14efd8c027aa..337ead927d4d 100644 --- a/include/linux/unwind_deferred.h +++ b/include/linux/unwind_deferred.h @@ -13,10 +13,19 @@ typedef void (*unwind_callback_t)(struct unwind_work *w= ork, struct unwind_stackt struct unwind_work { struct list_head list; unwind_callback_t func; + int bit; }; =20 #ifdef CONFIG_UNWIND_USER =20 +enum { + UNWIND_PENDING_BIT =3D 0, +}; + +enum { + UNWIND_PENDING =3D BIT(UNWIND_PENDING_BIT), +}; + void unwind_task_init(struct task_struct *task); void unwind_task_free(struct task_struct *task); =20 @@ -28,15 +37,26 @@ void unwind_deferred_cancel(struct unwind_work *work); =20 static __always_inline void unwind_reset_info(void) { - if (unlikely(current->unwind_info.id.id)) + struct unwind_task_info *info =3D ¤t->unwind_info; + unsigned long bits; + + /* Was there any unwinding? */ + if (unlikely(info->unwind_mask)) { + bits =3D info->unwind_mask; + do { + /* Is a task_work going to run again before going back */ + if (bits & UNWIND_PENDING) + return; + } while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL)); current->unwind_info.id.id =3D 0; + } /* * As unwind_user_faultable() can be called directly and * depends on nr_entries being cleared on exit to user, * this needs to be a separate conditional. */ - if (unlikely(current->unwind_info.cache)) - current->unwind_info.cache->nr_entries =3D 0; + if (unlikely(info->cache)) + info->cache->nr_entries =3D 0; } =20 #else /* !CONFIG_UNWIND_USER */ diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_d= eferred_types.h index 104c477d5609..5dc9cda141ff 100644 --- a/include/linux/unwind_deferred_types.h +++ b/include/linux/unwind_deferred_types.h @@ -29,10 +29,10 @@ union unwind_task_id { }; =20 struct unwind_task_info { + unsigned long unwind_mask; struct unwind_cache *cache; struct callback_head work; union unwind_task_id id; - int pending; }; =20 #endif /* _LINUX_UNWIND_USER_DEFERRED_TYPES_H */ diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index c5ac087d2396..e19f02ef416d 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -45,6 +45,16 @@ static inline bool try_assign_cnt(struct unwind_task_inf= o *info, u32 cnt) static DEFINE_MUTEX(callback_mutex); static LIST_HEAD(callbacks); =20 +#define RESERVED_BITS (UNWIND_PENDING) + +/* Zero'd bits are available for assigning callback users */ +static unsigned long unwind_mask =3D RESERVED_BITS; + +static inline bool unwind_pending(struct unwind_task_info *info) +{ + return test_bit(UNWIND_PENDING_BIT, &info->unwind_mask); +} + /* * This is a unique percpu identifier for a given task entry context. * Conceptually, it's incremented every time the CPU enters the kernel from @@ -138,14 +148,15 @@ static void unwind_deferred_task_work(struct callback= _head *head) struct unwind_task_info *info =3D container_of(head, struct unwind_task_i= nfo, work); struct unwind_stacktrace trace; struct unwind_work *work; + unsigned long bits; u64 cookie; =20 - if (WARN_ON_ONCE(!info->pending)) + if (WARN_ON_ONCE(!unwind_pending(info))) return; =20 - /* Allow work to come in again */ - WRITE_ONCE(info->pending, 0); - + /* Clear pending bit but make sure to have the current bits */ + bits =3D atomic_long_fetch_andnot(UNWIND_PENDING, + (atomic_long_t *)&info->unwind_mask); /* * From here on out, the callback must always be called, even if it's * just an empty trace. @@ -159,7 +170,8 @@ static void unwind_deferred_task_work(struct callback_h= ead *head) =20 guard(mutex)(&callback_mutex); list_for_each_entry(work, &callbacks, list) { - work->func(work, &trace, cookie); + if (test_bit(work->bit, &bits)) + work->func(work, &trace, cookie); } } =20 @@ -183,15 +195,16 @@ static void unwind_deferred_task_work(struct callback= _head *head) * because it has already been previously called for the same entry contex= t, * it will be called again with the same stack trace and cookie. * - * Return: 1 if the the callback was already queued. - * 0 if the callback successfully was queued. + * Return: 0 if the callback successfully was queued. + * 1 if the callback is pending or was already executed. * Negative if there's an error. * @cookie holds the cookie of the first request by any user */ int unwind_deferred_request(struct unwind_work *work, u64 *cookie) { struct unwind_task_info *info =3D ¤t->unwind_info; - long pending; + unsigned long old, bits; + unsigned long bit =3D BIT(work->bit); int ret; =20 *cookie =3D 0; @@ -212,32 +225,59 @@ int unwind_deferred_request(struct unwind_work *work,= u64 *cookie) =20 *cookie =3D get_cookie(info); =20 - /* callback already pending? */ - pending =3D READ_ONCE(info->pending); - if (pending) - return 1; + old =3D READ_ONCE(info->unwind_mask); =20 - /* Claim the work unless an NMI just now swooped in to do so. */ - if (!try_cmpxchg(&info->pending, &pending, 1)) + /* Is this already queued or executed */ + if (old & bit) return 1; =20 + /* + * This work's bit hasn't been set yet. Now set it with the PENDING + * bit and fetch the current value of unwind_mask. If ether the + * work's bit or PENDING was already set, then this is already queued + * to have a callback. + */ + bits =3D UNWIND_PENDING | bit; + old =3D atomic_long_fetch_or(bits, (atomic_long_t *)&info->unwind_mask); + if (old & bits) { + /* + * If the work's bit was set, whatever set it had better + * have also set pending and queued a callback. + */ + WARN_ON_ONCE(!(old & UNWIND_PENDING)); + return old & bit; + } + /* The work has been claimed, now schedule it. */ ret =3D task_work_add(current, &info->work, TWA_RESUME); - if (WARN_ON_ONCE(ret)) { - WRITE_ONCE(info->pending, 0); - return ret; - } =20 - return 0; + if (WARN_ON_ONCE(ret)) + WRITE_ONCE(info->unwind_mask, 0); + + return ret; } =20 void unwind_deferred_cancel(struct unwind_work *work) { + struct task_struct *g, *t; + if (!work) return; =20 + /* No work should be using a reserved bit */ + if (WARN_ON_ONCE(BIT(work->bit) & RESERVED_BITS)) + return; + guard(mutex)(&callback_mutex); list_del(&work->list); + + __clear_bit(work->bit, &unwind_mask); + + guard(rcu)(); + /* Clear this bit from all threads */ + for_each_process_thread(g, t) { + clear_bit(work->bit, &t->unwind_info.unwind_mask); + } } =20 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func) @@ -245,6 +285,14 @@ int unwind_deferred_init(struct unwind_work *work, unw= ind_callback_t func) memset(work, 0, sizeof(*work)); =20 guard(mutex)(&callback_mutex); + + /* See if there's a bit in the mask available */ + if (unwind_mask =3D=3D ~0UL) + return -EBUSY; + + work->bit =3D ffz(unwind_mask); + __set_bit(work->bit, &unwind_mask); + list_add(&work->list, &callbacks); work->func =3D func; return 0; @@ -256,6 +304,7 @@ void unwind_task_init(struct task_struct *task) =20 memset(info, 0, sizeof(*info)); init_task_work(&info->work, unwind_deferred_task_work); + info->unwind_mask =3D 0; } =20 void unwind_task_free(struct task_struct *task) --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 77E362857F6 for ; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; cv=none; b=bE2+AJmNHPMIRgUtGfcWI+tIeGzBDeOvfhQuQIKpPDM8w2vCa3qXY9ZuLwDmANiAZK7sc812nclJ5JioTGw9YFEfB0gFakPwacN2i+SZZX0+2J8Cj8WkUvOUPebStzb4yeDzChPA5TxmCFCcN2GOAU1tH7g0WU1Aapz/YERQKUk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; c=relaxed/simple; bh=zxeIOtfqBO0cjROJjLLehq1s2dLbW3VEVHSE+Bl8A2s=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=fTwMjqW9dlSJErc/eddj4qwzTlvdhD+sUsQmACvkJDrAmdeCR4bSXp2CwDNL/xhPk4CmwjC/odwlV6Po1HprBFz9BT/IcjN3rQCfJOs2bGpR1pSEoE+MPKmuOCSfDwQ9rfKekxdpD0FlGCugq9UmQpW6UxXsPHD3iVsRKfl5F1U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=kBHChjKM; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="kBHChjKM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 53B16C4CEF6; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539137; bh=zxeIOtfqBO0cjROJjLLehq1s2dLbW3VEVHSE+Bl8A2s=; h=Date:From:To:Cc:Subject:References:From; b=kBHChjKMKXDYKJqBwJ3m1fXdxK32fbUeu6t3vLLtF+ecfTVKGLkUAlovbw5Q8PG4n Te8ycLzqD3xa9MnLVyPCZFkGOCpm7JiOkDETgv7qaVQu+s3gcfGuOnh8uazoEegdk4 xUfIASxU3bl571iAE5gDDyH+4Ffeq44cGcbC09yc2btF1mOhVBsHUib6FkAq/n/6k/ 0wdbjJbRwkZSClpURLM0rEkDVHz8F42h/w+TtCuS3yhuv30r5qUQPB2E7AGB5fGWyt yD24JEXzOb6QNeEAds0khnXnrHBmp88qk1GhrRIoIEeJJCWwBQMTAVh+gML9zWp29B Jg3QCTzopgsPQ== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdR-00000001seb-0SIT; Sat, 26 Jul 2025 10:12:25 -0400 Message-ID: <20250726141224.964201094@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:11 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 07/10] unwind deferred: Add unwind_completed mask to stop spurious callbacks References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt If there's more than one registered tracer to the unwind deferred infrastructure, it is currently possible that one tracer could cause extra callbacks to happen for another tracer if the former requests a deferred stacktrace after the latter's callback was executed and before the task went back to user space. Here's an example of how this could occur: [Task enters kernel] tracer 1 request -> add cookie to its buffer tracer 1 request -> add cookie to its buffer <..> [ task work executes ] tracer 1 callback -> add trace + cookie to its buffer [tracer 2 requests and triggers the task work again] [ task work executes again ] tracer 1 callback -> add trace + cookie to its buffer tracer 2 callback -> add trace + cookie to its buffer [Task exits back to user space] This is because the bit for tracer 1 gets set in the task's unwind_mask when it did its request and does not get cleared until the task returns back to user space. But if another tracer were to request another deferred stacktrace, then the next task work will executed all tracer's callbacks that have their bits set in the task's unwind_mask. To fix this issue, add another mask called unwind_completed and place it into the task's info->cache structure. The cache structure is allocated on the first occurrence of a deferred stacktrace and this unwind_completed mask is not needed until then. It's better to have it in the cache than to permanently waste space in the task_struct. After a tracer's callback is executed, it's bit gets set in this unwind_completed mask. When the task_work enters, it will AND the task's unwind_mask with the inverse of the unwind_completed which will eliminate any work that already had its callback executed since the task entered the kernel. When the task leaves the kernel, it will reset this unwind_completed mask just like it resets the other values as it enters user space. Link: https://lore.kernel.org/all/20250716142609.47f0e4a5@batman.local.home/ Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Josh Poimboeuf Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185740.245440579@kernel.org Link: https://lore.kernel.org/20250717004957.580552530@kernel.org Signed-off-by: Steven Rostedt (Google) --- include/linux/unwind_deferred.h | 4 +++- include/linux/unwind_deferred_types.h | 1 + kernel/unwind/deferred.c | 19 +++++++++++++++---- 3 files changed, 19 insertions(+), 5 deletions(-) diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h index 337ead927d4d..b9ec4c8515c7 100644 --- a/include/linux/unwind_deferred.h +++ b/include/linux/unwind_deferred.h @@ -55,8 +55,10 @@ static __always_inline void unwind_reset_info(void) * depends on nr_entries being cleared on exit to user, * this needs to be a separate conditional. */ - if (unlikely(info->cache)) + if (unlikely(info->cache)) { info->cache->nr_entries =3D 0; + info->cache->unwind_completed =3D 0; + } } =20 #else /* !CONFIG_UNWIND_USER */ diff --git a/include/linux/unwind_deferred_types.h b/include/linux/unwind_d= eferred_types.h index 5dc9cda141ff..33b62ac25c86 100644 --- a/include/linux/unwind_deferred_types.h +++ b/include/linux/unwind_deferred_types.h @@ -3,6 +3,7 @@ #define _LINUX_UNWIND_USER_DEFERRED_TYPES_H =20 struct unwind_cache { + unsigned long unwind_completed; unsigned int nr_entries; unsigned long entries[]; }; diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index e19f02ef416d..a3d26014a2e6 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -166,12 +166,18 @@ static void unwind_deferred_task_work(struct callback= _head *head) =20 unwind_user_faultable(&trace); =20 + if (info->cache) + bits &=3D ~(info->cache->unwind_completed); + cookie =3D info->id.id; =20 guard(mutex)(&callback_mutex); list_for_each_entry(work, &callbacks, list) { - if (test_bit(work->bit, &bits)) + if (test_bit(work->bit, &bits)) { work->func(work, &trace, cookie); + if (info->cache) + info->cache->unwind_completed |=3D BIT(work->bit); + } } } =20 @@ -260,23 +266,28 @@ int unwind_deferred_request(struct unwind_work *work,= u64 *cookie) void unwind_deferred_cancel(struct unwind_work *work) { struct task_struct *g, *t; + int bit; =20 if (!work) return; =20 + bit =3D work->bit; + /* No work should be using a reserved bit */ - if (WARN_ON_ONCE(BIT(work->bit) & RESERVED_BITS)) + if (WARN_ON_ONCE(BIT(bit) & RESERVED_BITS)) return; =20 guard(mutex)(&callback_mutex); list_del(&work->list); =20 - __clear_bit(work->bit, &unwind_mask); + __clear_bit(bit, &unwind_mask); =20 guard(rcu)(); /* Clear this bit from all threads */ for_each_process_thread(g, t) { - clear_bit(work->bit, &t->unwind_info.unwind_mask); + clear_bit(bit, &t->unwind_info.unwind_mask); + if (t->unwind_info.cache) + clear_bit(bit, &t->unwind_info.cache->unwind_completed); } } =20 --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9D575285CA8 for ; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; cv=none; b=X73Inptw7qZLgjX0ys1pZj3ObZtML2HJ0tfp9RMoNnPbPOjF28ntHyj1Y/jHTvje70KHROZkA5h+DQgX8nkIt62eELAsrFCPYU10eeeGN+XHRTMeitMR3fNaoWXl2yBTBhCHrGw6cFugOibmXlr4ARwFjtOqIarC9amrnXgJBVE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; c=relaxed/simple; bh=TSnt1MZyrgImeniojWyfyPM5ExY493RrmpPNKzuycvE=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=RRTjd+vHYkons34ym4qs1Vi1r3fgfjUs7akPgyhCcGxuKB0m3qpAdXlEktt7KrrFg8TJEtxjD5STmsMA5pjuP5Hjx/ewKbgHt+uLiMCkQipo0osAou2MJcZ2UqRLpMaV2pQ867Hm7t/LX6nd1/9bTEJJESkwuWB9FjSOx6KRGVs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=h24hNXQO; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="h24hNXQO" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7B5FAC4CEFB; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539137; bh=TSnt1MZyrgImeniojWyfyPM5ExY493RrmpPNKzuycvE=; h=Date:From:To:Cc:Subject:References:From; b=h24hNXQOpc5JZg5jxQMB9Gryq2LacD3F7RlJew+XkVzbhpguRuQuulM7WzMGNo+hI EV8CtgUIboWSwhxZMS2A+zz6u6aT3ltp0vfBQvLmaQs6luGiu/ZEhjjtwD4Y8i4cNJ m7kg63N1ZqjnBABcuFCTRg1svaNgNQvLS+7li30akdYlTXi0KrYRLYnKhdr59NRU8q U3tb1THCxK8KAAUa1TmwHi2Wj1EfXbw9ZBR4XH2fNT0MVVA66vJe2OdI+J7XrEo1+D xWHEN60uFmKgzq8isNkmhvsLDZuSNJha1QnUzWQuDmPD5jUe1M+ofYGfnL1x+hFbeM F//rxOdZ/Nqmg== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdR-00000001sf6-1A0m; Sat, 26 Jul 2025 10:12:25 -0400 Message-ID: <20250726141225.129577701@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:12 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 08/10] unwind: Add USED bit to only have one conditional on way back to user space References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt On the way back to user space, the function unwind_reset_info() is called unconditionally (but always inlined). It currently has two conditionals. One that checks the unwind_mask which is set whenever a deferred trace is called and is used to know that the mask needs to be cleared. The other checks if the cache has been allocated, and if so, it resets the nr_entries so that the unwinder knows it needs to do the work to get a new user space stack trace again (it only does it once per entering the kernel). Use one of the bits in the unwind mask as a "USED" bit that gets set whenever a trace is created. This will make it possible to only check the unwind_mask in the unwind_reset_info() to know if it needs to do work or not and eliminates a conditional that happens every time the task goes back to user space. Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Josh Poimboeuf Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185740.414087734@kernel.org Signed-off-by: Steven Rostedt (Google) --- include/linux/unwind_deferred.h | 18 +++++++++--------- kernel/unwind/deferred.c | 5 ++++- 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h index b9ec4c8515c7..2efbda01e959 100644 --- a/include/linux/unwind_deferred.h +++ b/include/linux/unwind_deferred.h @@ -20,10 +20,14 @@ struct unwind_work { =20 enum { UNWIND_PENDING_BIT =3D 0, + UNWIND_USED_BIT, }; =20 enum { UNWIND_PENDING =3D BIT(UNWIND_PENDING_BIT), + + /* Set if the unwinding was used (directly or deferred) */ + UNWIND_USED =3D BIT(UNWIND_USED_BIT) }; =20 void unwind_task_init(struct task_struct *task); @@ -49,15 +53,11 @@ static __always_inline void unwind_reset_info(void) return; } while (!try_cmpxchg(&info->unwind_mask, &bits, 0UL)); current->unwind_info.id.id =3D 0; - } - /* - * As unwind_user_faultable() can be called directly and - * depends on nr_entries being cleared on exit to user, - * this needs to be a separate conditional. - */ - if (unlikely(info->cache)) { - info->cache->nr_entries =3D 0; - info->cache->unwind_completed =3D 0; + + if (unlikely(info->cache)) { + info->cache->nr_entries =3D 0; + info->cache->unwind_completed =3D 0; + } } } =20 diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index a3d26014a2e6..2311b725d691 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -45,7 +45,7 @@ static inline bool try_assign_cnt(struct unwind_task_info= *info, u32 cnt) static DEFINE_MUTEX(callback_mutex); static LIST_HEAD(callbacks); =20 -#define RESERVED_BITS (UNWIND_PENDING) +#define RESERVED_BITS (UNWIND_PENDING | UNWIND_USED) =20 /* Zero'd bits are available for assigning callback users */ static unsigned long unwind_mask =3D RESERVED_BITS; @@ -140,6 +140,9 @@ int unwind_user_faultable(struct unwind_stacktrace *tra= ce) =20 cache->nr_entries =3D trace->nr; =20 + /* Clear nr_entries on way back to user space */ + set_bit(UNWIND_USED_BIT, &info->unwind_mask); + return 0; } =20 --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DBC171FDA for ; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; cv=none; b=gH3m7M92SwXU+u2YHnzkGR1v40zidwCuEPLDtZwoel+l6PMxWWOIiu+v+8ZmaqxXDO5HpiZq3C8KRHHL7e6b6Xke/0RJbvS5FK/d9VQDsiFaZvmfXoqpm48wdCKqApBqF/6dladg2bEyJergDOPwKwIOHeqZawASUw0UZ8HUxcA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539137; c=relaxed/simple; bh=j87oZMixFh27VAzvjm5wL6cqEO6q6gYzANZEcaCLQTY=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=tkZf0YsKPvOOVgiihm6hAFf7gjMSf5edB4AU6f4VFeGN2H8UEZp98hHdiLLMtU/WjbGQzOI10ahYZH03jvufa/Gjo31vGKN0TUGoQ7tQ+46nQI4+UGtIkHcv64eEGlgSmCgBCrtWABkKgN5jcEBlBRlMl2el5Fy88I5qYV1AIY8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ErfuAomT; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ErfuAomT" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B9DD1C4CEED; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539137; bh=j87oZMixFh27VAzvjm5wL6cqEO6q6gYzANZEcaCLQTY=; h=Date:From:To:Cc:Subject:References:From; b=ErfuAomTc6J1pOOH5vZA4TOcOD7Jn3Jto516/GVGWxRd4tO3MoZj4RuUBi/2CF2G5 z3LJsoYkFDHqY1S/K2ZsyrmzYHZxMcIG9gU41Vj2UAwZMpBypVNkWeYbQDlkaNEQk9 EYgE8rUX8NXunY+rxESKAvkJUDYbbfLTfsxHAh8kpeLAboWLU9fWwCeeqWBAsmGOWG L+nt3b99dB9lGenLcQXVgBlD5etZ8/gXSznj2gVhxDA5VtFlx18Ji6PCnylYUhgBcX al8wWqShmhJ/2yEg3du++oPjZbuTQGZZpaVF2ERMzglBzMNgb5CProCQFr+PXMBV53 Wj5SdjVRK5IAQ== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdR-00000001sfc-1sWq; Sat, 26 Jul 2025 10:12:25 -0400 Message-ID: <20250726141225.297220502@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:13 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , "Paul E. McKenney" , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 09/10] unwind deferred: Use SRCU unwind_deferred_task_work() References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt Instead of using the callback_mutex to protect the link list of callbacks in unwind_deferred_task_work(), use SRCU instead. This gets called every time a task exits that has to record a stack trace that was requested. This can happen for many tasks on several CPUs at the same time. A mutex is a bottleneck and can cause a bit of contention and slow down performance. As the callbacks themselves are allowed to sleep, regular RCU cannot be used to protect the list. Instead use SRCU, as that still allows the callbacks to sleep and the list can be read without needing to hold the callback_mutex. Link: https://lore.kernel.org/all/ca9bd83a-6c80-4ee0-a83c-224b9d60b755@effi= cios.com/ Cc: "Paul E. McKenney" Cc: Masami Hiramatsu Cc: Josh Poimboeuf Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185740.581435592@kernel.org Suggested-by: Mathieu Desnoyers Signed-off-by: Steven Rostedt (Google) --- kernel/unwind/deferred.c | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index 2311b725d691..a5ef1c1f915e 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -41,7 +41,7 @@ static inline bool try_assign_cnt(struct unwind_task_info= *info, u32 cnt) #define UNWIND_MAX_ENTRIES \ ((SZ_4K - sizeof(struct unwind_cache)) / sizeof(long)) =20 -/* Guards adding to and reading the list of callbacks */ +/* Guards adding to or removing from the list of callbacks */ static DEFINE_MUTEX(callback_mutex); static LIST_HEAD(callbacks); =20 @@ -49,6 +49,7 @@ static LIST_HEAD(callbacks); =20 /* Zero'd bits are available for assigning callback users */ static unsigned long unwind_mask =3D RESERVED_BITS; +DEFINE_STATIC_SRCU(unwind_srcu); =20 static inline bool unwind_pending(struct unwind_task_info *info) { @@ -174,8 +175,9 @@ static void unwind_deferred_task_work(struct callback_h= ead *head) =20 cookie =3D info->id.id; =20 - guard(mutex)(&callback_mutex); - list_for_each_entry(work, &callbacks, list) { + guard(srcu)(&unwind_srcu); + list_for_each_entry_srcu(work, &callbacks, list, + srcu_read_lock_held(&unwind_srcu)) { if (test_bit(work->bit, &bits)) { work->func(work, &trace, cookie); if (info->cache) @@ -213,7 +215,7 @@ int unwind_deferred_request(struct unwind_work *work, u= 64 *cookie) { struct unwind_task_info *info =3D ¤t->unwind_info; unsigned long old, bits; - unsigned long bit =3D BIT(work->bit); + unsigned long bit; int ret; =20 *cookie =3D 0; @@ -230,6 +232,14 @@ int unwind_deferred_request(struct unwind_work *work, = u64 *cookie) if (WARN_ON_ONCE(!CAN_USE_IN_NMI && in_nmi())) return -EINVAL; =20 + /* Do not allow cancelled works to request again */ + bit =3D READ_ONCE(work->bit); + if (WARN_ON_ONCE(bit < 0)) + return -EINVAL; + + /* Only need the mask now */ + bit =3D BIT(bit); + guard(irqsave)(); =20 *cookie =3D get_cookie(info); @@ -281,10 +291,15 @@ void unwind_deferred_cancel(struct unwind_work *work) return; =20 guard(mutex)(&callback_mutex); - list_del(&work->list); + list_del_rcu(&work->list); + + /* Do not allow any more requests and prevent callbacks */ + work->bit =3D -1; =20 __clear_bit(bit, &unwind_mask); =20 + synchronize_srcu(&unwind_srcu); + guard(rcu)(); /* Clear this bit from all threads */ for_each_process_thread(g, t) { @@ -307,7 +322,7 @@ int unwind_deferred_init(struct unwind_work *work, unwi= nd_callback_t func) work->bit =3D ffz(unwind_mask); __set_bit(work->bit, &unwind_mask); =20 - list_add(&work->list, &callbacks); + list_add_rcu(&work->list, &callbacks); work->func =3D func; return 0; } --=20 2.47.2 From nobody Mon Oct 6 01:26:43 2025 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id F20CF28688C for ; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539138; cv=none; b=huHK3hiLJ2T2Wj5WknR9bnc6PQzbYhC8sNkEG0OhDjEE6IopWFKeumCDjIj7OL1bycSMmjOZ+kyw3GltnQRQ5NWNSPOezr/cfTBNOZNdJkPI/Jx5UJFEUrp4ztebYJG3VJajGma3kpH61MycdY3laBFtGsHJgDd6qVboc4t98Nw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753539138; c=relaxed/simple; bh=TZpjkXBbQZ0/7HMBCrfXxzxFf0zNLQMS0dsRZI7G/40=; h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version: Content-Type; b=K+tUnka6tsftzBunM/9oRkWnCUuhcTLLFxDTIKYx3nR/J3CLD/kLgggU6jGzYK+G1wZ0iUjT1j+b3Q5jN5b/X7R/XehZBB52PuY18UrL9Cs6v6DQPoqAcHQWPwTNKRvBM4p6ATC/rSOxUPG9hKsfzOwiRFF+57vhwrWmgfg0PPw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=gBserE0q; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="gBserE0q" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D05EBC4CEF6; Sat, 26 Jul 2025 14:12:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753539137; bh=TZpjkXBbQZ0/7HMBCrfXxzxFf0zNLQMS0dsRZI7G/40=; h=Date:From:To:Cc:Subject:References:From; b=gBserE0qjIuMoap6TaoVC2lXIpgtZ2ZrcLuFmDykwMli4XaIf7hy28sNROcwwdHOe wirPb02es16diVd43tzmP4x5wrdMc8MqF0Q9EfrGm+ytGx7KLgeJlQjL5TuK1lS8OP Dh9tYD+z5bbyZQJx/hsxKZggvUsegUhZ64xOAT6bNwRrt2H4QD7/NddeEBc/kVELdA SAdC2CENCuoh26N00CMbgoKjDkJnHz140/3pfLH2B/V7aXbtDnUar6SPMYw1wrA7fp cnE4BvdWX57FBIpyYIU0nD8wd0/YII6SRia15XDLs96mIl1Gjq480+Tq776svForKe 2erLEWL7LG1FA== Received: from rostedt by gandalf with local (Exim 4.98.2) (envelope-from ) id 1uffdR-00000001sg7-2bUM; Sat, 26 Jul 2025 10:12:25 -0400 Message-ID: <20250726141225.470646928@kernel.org> User-Agent: quilt/0.68 Date: Sat, 26 Jul 2025 10:07:14 -0400 From: Steven Rostedt To: linux-kernel@vger.kernel.org Cc: Masami Hiramatsu , Mark Rutland , Mathieu Desnoyers , Andrew Morton , Peter Zijlstra , Linus Torvalds , Ingo Molnar , Josh Poimboeuf , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Thomas Gleixner , Andrii Nakryiko , Indu Bhagat , "Jose E. Marchesi" , Beau Belgrave , Jens Remus , Jens Axboe , Florian Weimer , Sam James Subject: [for-next][PATCH 10/10] unwind: Finish up unwind when a task exits References: <20250726140704.560579628@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Steven Rostedt On do_exit() when a task is exiting, if a unwind is requested and the deferred user stacktrace is deferred via the task_work, the task_work callback is called after exit_mm() is called in do_exit(). This means that the user stack trace will not be retrieved and an empty stack is created. Instead, add a function unwind_deferred_task_exit() and call it just before exit_mm() so that the unwinder can call the requested callbacks with the user space stack. Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Josh Poimboeuf Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Arnaldo Carvalho de Melo Cc: Namhyung Kim Cc: Thomas Gleixner Cc: Andrii Nakryiko Cc: Indu Bhagat Cc: "Jose E. Marchesi" Cc: Beau Belgrave Cc: Jens Remus Cc: Linus Torvalds Cc: Andrew Morton Cc: Jens Axboe Cc: Florian Weimer Cc: Sam James Link: https://lore.kernel.org/20250725185740.748555530@kernel.org Signed-off-by: Steven Rostedt (Google) --- include/linux/unwind_deferred.h | 3 +++ kernel/exit.c | 2 ++ kernel/unwind/deferred.c | 23 ++++++++++++++++++++--- 3 files changed, 25 insertions(+), 3 deletions(-) diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre= d.h index 2efbda01e959..26122d00708a 100644 --- a/include/linux/unwind_deferred.h +++ b/include/linux/unwind_deferred.h @@ -39,6 +39,8 @@ int unwind_deferred_init(struct unwind_work *work, unwind= _callback_t func); int unwind_deferred_request(struct unwind_work *work, u64 *cookie); void unwind_deferred_cancel(struct unwind_work *work); =20 +void unwind_deferred_task_exit(struct task_struct *task); + static __always_inline void unwind_reset_info(void) { struct unwind_task_info *info =3D ¤t->unwind_info; @@ -71,6 +73,7 @@ static inline int unwind_deferred_init(struct unwind_work= *work, unwind_callback static inline int unwind_deferred_request(struct unwind_work *work, u64 *t= imestamp) { return -ENOSYS; } static inline void unwind_deferred_cancel(struct unwind_work *work) {} =20 +static inline void unwind_deferred_task_exit(struct task_struct *task) {} static inline void unwind_reset_info(void) {} =20 #endif /* !CONFIG_UNWIND_USER */ diff --git a/kernel/exit.c b/kernel/exit.c index bb184a67ac73..1d8c8ac33c4f 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -68,6 +68,7 @@ #include #include #include +#include #include #include =20 @@ -938,6 +939,7 @@ void __noreturn do_exit(long code) =20 tsk->exit_code =3D code; taskstats_exit(tsk, group_dead); + unwind_deferred_task_exit(tsk); trace_sched_process_exit(tsk, group_dead); =20 /* diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c index a5ef1c1f915e..dc6040aae3ee 100644 --- a/kernel/unwind/deferred.c +++ b/kernel/unwind/deferred.c @@ -114,7 +114,7 @@ int unwind_user_faultable(struct unwind_stacktrace *tra= ce) /* Should always be called from faultable context */ might_fault(); =20 - if (current->flags & PF_EXITING) + if (!current->mm) return -EINVAL; =20 if (!info->cache) { @@ -147,9 +147,9 @@ int unwind_user_faultable(struct unwind_stacktrace *tra= ce) return 0; } =20 -static void unwind_deferred_task_work(struct callback_head *head) +static void process_unwind_deferred(struct task_struct *task) { - struct unwind_task_info *info =3D container_of(head, struct unwind_task_i= nfo, work); + struct unwind_task_info *info =3D &task->unwind_info; struct unwind_stacktrace trace; struct unwind_work *work; unsigned long bits; @@ -186,6 +186,23 @@ static void unwind_deferred_task_work(struct callback_= head *head) } } =20 +static void unwind_deferred_task_work(struct callback_head *head) +{ + process_unwind_deferred(current); +} + +void unwind_deferred_task_exit(struct task_struct *task) +{ + struct unwind_task_info *info =3D ¤t->unwind_info; + + if (!unwind_pending(info)) + return; + + process_unwind_deferred(task); + + task_work_cancel(task, &info->work); +} + /** * unwind_deferred_request - Request a user stacktrace on task kernel exit * @work: Unwind descriptor requesting the trace --=20 2.47.2