From nobody Mon Nov 25 09:32:36 2024 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 48B951EE009; Mon, 28 Oct 2024 21:48:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730152101; cv=none; b=DDXnnwbneMCbhQ7HmYPY8cueuL3zyci7xf65ZLc28TU4BnZwv2swg0cItVJpygSQn0ktY/K5ZzW+fCjnOQJztE+Vd181Pn+tuatoTyDMPY3bj0uwoMhgWuWyvWa6aqFv6gXJsLYLRAd9ZsLS12MZclyShGoq+8DklszGUTOUPok= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1730152101; c=relaxed/simple; bh=jD51Fu0vOpWdomX4mvhQMZ5VqCrli3vqWQMm2r9eaGU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=F/jULpdJXdJ0SjrvQBct/eMGr0In4b2ar+rjCokdmY69bP6XpB/yLvF+Dtvm/vdIPL/LuOpIzqifeMj0yM98x7RgFyNnKOqcFE5wIfcCKpERSyqSdF5x7YOUz976fVsdfnXpNGjIrhenefpwVeIeEItWHygua23Zgx1hx04oWK8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=uFxVy60C; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="uFxVy60C" Received: by smtp.kernel.org (Postfix) with ESMTPSA id F1918C4CEE7; Mon, 28 Oct 2024 21:48:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1730152100; bh=jD51Fu0vOpWdomX4mvhQMZ5VqCrli3vqWQMm2r9eaGU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=uFxVy60CEcA/5zuwtfOIreofS/R0sLwen8cwAUMkVBPLlwJhrcip6tYMUKGGUV2f+ ZDzOQhQjMod7etQKBIkCm4pZ9wfCfe8rtdD8dyC8i2yeDtDMA36QOsEzhUbrSx9i2K Y5mvUrvIkCtL0SXXA5E4uWvHSLMWwFU6uGkYMSc9d4Ta71Oa8kb9dinOEXlrp52gaS UMIzyLCsMit4T2wg5VlYROQpvj99EtLmdxE8FoTtItDZa++8zVde6AP9fmLQiwOwGC 07XYy3D6u8wV+W2oLgNZqifNDMNDvVXgFOL+v5SuI5WEItRHsIsJCxhObf25oOmKKb eelS1JX5HWj0Q== From: Josh Poimboeuf To: x86@kernel.org Cc: Peter Zijlstra , Steven Rostedt , Ingo Molnar , Arnaldo Carvalho de Melo , linux-kernel@vger.kernel.org, Indu Bhagat , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , linux-perf-users@vger.kernel.org, Mark Brown , linux-toolchains@vger.kernel.org, Jordan Rome , Sam James , linux-trace-kernel@vger.kerne.org, Andrii Nakryiko , Jens Remus , Mathieu Desnoyers , Florian Weimer , Andy Lutomirski Subject: [PATCH v3 11/19] unwind: Add deferred user space unwinding API Date: Mon, 28 Oct 2024 14:47:38 -0700 Message-ID: X-Mailer: git-send-email 2.47.0 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add unwind_user_deferred() which allows callers to schedule task work to unwind the user space stack before returning to user space. This solves several problems for its callers: - Ensure the unwind happens in task context even if the caller may running in interrupt context. - Only do the unwind once, even if called multiple times either by the same caller or multiple callers. - Create a "context context" cookie which allows trace post-processing to correlate kernel unwinds/traces with the user unwind. Signed-off-by: Josh Poimboeuf --- include/linux/entry-common.h | 3 + include/linux/sched.h | 5 + include/linux/unwind_user.h | 56 ++++++++++ kernel/fork.c | 4 + kernel/unwind/user.c | 199 +++++++++++++++++++++++++++++++++++ 5 files changed, 267 insertions(+) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 1e50cdb83ae5..efbe8f964f31 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -12,6 +12,7 @@ #include #include #include +#include =20 #include =20 @@ -111,6 +112,8 @@ static __always_inline void enter_from_user_mode(struct= pt_regs *regs) CT_WARN_ON(__ct_state() !=3D CT_STATE_USER); user_exit_irqoff(); =20 + unwind_enter_from_user_mode(); + instrumentation_begin(); kmsan_unpoison_entry_regs(regs); trace_hardirqs_off_finish(); diff --git a/include/linux/sched.h b/include/linux/sched.h index 5007a8e2d640..31b6f1d763ef 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -47,6 +47,7 @@ #include #include #include +#include =20 /* task_struct member predeclarations (sorted alphabetically): */ struct audit_context; @@ -1592,6 +1593,10 @@ struct task_struct { struct user_event_mm *user_event_mm; #endif =20 +#ifdef CONFIG_UNWIND_USER + struct unwind_task_info unwind_task_info; +#endif + /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/include/linux/unwind_user.h b/include/linux/unwind_user.h index cde0fde4923e..98e236c843b1 100644 --- a/include/linux/unwind_user.h +++ b/include/linux/unwind_user.h @@ -3,6 +3,9 @@ #define _LINUX_UNWIND_USER_H =20 #include +#include + +#define UNWIND_MAX_CALLBACKS 4 =20 enum unwind_user_type { UNWIND_USER_TYPE_NONE, @@ -30,6 +33,26 @@ struct unwind_user_state { bool done; }; =20 +struct unwind_task_info { + u64 ctx_cookie; + u32 pending_callbacks; + u64 last_cookies[UNWIND_MAX_CALLBACKS]; + void *privs[UNWIND_MAX_CALLBACKS]; + unsigned long *entries; + struct callback_head work; +}; + +typedef void (*unwind_callback_t)(struct unwind_stacktrace *trace, + u64 ctx_cookie, void *data); + +struct unwind_callback { + unwind_callback_t func; + int idx; +}; + + +#ifdef CONFIG_UNWIND_USER + /* Synchronous interfaces: */ =20 int unwind_user_start(struct unwind_user_state *state); @@ -40,4 +63,37 @@ int unwind_user(struct unwind_stacktrace *trace, unsigne= d int max_entries); #define for_each_user_frame(state) \ for (unwind_user_start((state)); !(state)->done; unwind_user_next((state)= )) =20 + +/* Asynchronous interfaces: */ + +void unwind_task_init(struct task_struct *task); +void unwind_task_free(struct task_struct *task); + +int unwind_user_register(struct unwind_callback *callback, unwind_callback= _t func); +int unwind_user_unregister(struct unwind_callback *callback); + +int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie= , void *data); + +DECLARE_PER_CPU(u64, unwind_ctx_ctr); + +static __always_inline void unwind_enter_from_user_mode(void) +{ + __this_cpu_inc(unwind_ctx_ctr); +} + + +#else /* !CONFIG_UNWIND_USER */ + +static inline void unwind_task_init(struct task_struct *task) {} +static inline void unwind_task_free(struct task_struct *task) {} + +static inline int unwind_user_register(struct unwind_callback *callback, u= nwind_callback_t func) { return -ENOSYS; } +static inline int unwind_user_unregister(struct unwind_callback *callback)= { return -ENOSYS; } + +static inline int unwind_user_deferred(struct unwind_callback *callback, u= 64 *ctx_cookie, void *data) { return -ENOSYS; } + +static inline void unwind_enter_from_user_mode(void) {} + +#endif /* !CONFIG_UNWIND_USER */ + #endif /* _LINUX_UNWIND_USER_H */ diff --git a/kernel/fork.c b/kernel/fork.c index 60f14fbab956..d7580067853d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -105,6 +105,7 @@ #include #include #include +#include #include =20 #include @@ -972,6 +973,7 @@ void __put_task_struct(struct task_struct *tsk) WARN_ON(refcount_read(&tsk->usage)); WARN_ON(tsk =3D=3D current); =20 + unwind_task_free(tsk); sched_ext_free(tsk); io_uring_free(tsk); cgroup_free(tsk); @@ -2348,6 +2350,8 @@ __latent_entropy struct task_struct *copy_process( p->bpf_ctx =3D NULL; #endif =20 + unwind_task_init(p); + /* Perform scheduler related setup. Assign this task to a CPU. */ retval =3D sched_fork(clone_flags, p); if (retval) diff --git a/kernel/unwind/user.c b/kernel/unwind/user.c index 8e47c80e3e54..ed7759c56551 100644 --- a/kernel/unwind/user.c +++ b/kernel/unwind/user.c @@ -10,6 +10,11 @@ #include #include #include +#include +#include +#include + +#define UNWIND_MAX_ENTRIES 512 =20 #ifdef CONFIG_HAVE_UNWIND_USER_FP #include @@ -20,6 +25,12 @@ static struct unwind_user_frame fp_frame =3D { static struct unwind_user_frame fp_frame; #endif =20 +static struct unwind_callback *callbacks[UNWIND_MAX_CALLBACKS]; +static DECLARE_RWSEM(callbacks_rwsem); + +/* Counter for entries from user space */ +DEFINE_PER_CPU(u64, unwind_ctx_ctr); + int unwind_user_next(struct unwind_user_state *state) { struct unwind_user_frame _frame; @@ -117,3 +128,191 @@ int unwind_user(struct unwind_stacktrace *trace, unsi= gned int max_entries) =20 return 0; } + +/* + * The "context cookie" is a unique identifier which allows post-processin= g to + * correlate kernel trace(s) with user unwinds. It has the CPU id the hig= hest + * 16 bits and a per-CPU entry counter in the lower 48 bits. + */ +static u64 ctx_to_cookie(u64 cpu, u64 ctx) +{ + BUILD_BUG_ON(NR_CPUS > 65535); + return (ctx & ((1UL << 48) - 1)) | cpu; +} + +/* + * Schedule a user space unwind to be done in task work before exiting the + * kernel. + * + * The @callback must have previously been registered with + * unwind_user_register(). + * + * The @cookie output is a unique identifer which will also be passed to t= he + * callback function. It can be used to stitch kernel and user traces tog= ether + * in post-processing. + * + * If there are multiple calls to this function for a given @callback, the + * cookie will usually be the same and the callback will only be called on= ce. + * + * The only exception is when the task has migrated to another CPU, *and* = this + * is called while the task work is running (or has already run). Then a = new + * cookie will be generated and the callback will be called again for the = new + * cookie. + */ +int unwind_user_deferred(struct unwind_callback *callback, u64 *ctx_cookie= , void *data) +{ + struct unwind_task_info *info =3D ¤t->unwind_task_info; + u64 cookie =3D info->ctx_cookie; + int idx =3D callback->idx; + + if (WARN_ON_ONCE(in_nmi())) + return -EINVAL; + + if (WARN_ON_ONCE(!callback->func || idx < 0)) + return -EINVAL; + + if (!current->mm) + return -EINVAL; + + guard(irqsave)(); + + if (cookie && (info->pending_callbacks & (1 << idx))) + goto done; + + /* + * If this is the first call from *any* caller since the most recent + * entry from user space, initialize the task context cookie and + * schedule the task work. + */ + if (!cookie) { + u64 ctx_ctr =3D __this_cpu_read(unwind_ctx_ctr); + u64 cpu =3D raw_smp_processor_id(); + + cookie =3D ctx_to_cookie(cpu, ctx_ctr); + + /* + * If called after task work has sent an unwind to the callback + * function but before the exit to user space, skip it as the + * previous call to the callback function should suffice. + * + * The only exception is if this task has migrated to another + * CPU since the first call to unwind_user_deferred(). The + * per-CPU context counter will have changed which will result + * in a new cookie and another unwind (see comment above + * function). + */ + if (cookie =3D=3D info->last_cookies[idx]) + goto done; + + info->ctx_cookie =3D cookie; + task_work_add(current, &info->work, TWA_RESUME); + } + + info->pending_callbacks |=3D (1 << idx); + info->privs[idx] =3D data; + info->last_cookies[idx] =3D cookie; + +done: + if (ctx_cookie) + *ctx_cookie =3D cookie; + return 0; +} + +static void unwind_user_task_work(struct callback_head *head) +{ + struct unwind_task_info *info =3D container_of(head, struct unwind_task_i= nfo, work); + struct task_struct *task =3D container_of(info, struct task_struct, unwin= d_task_info); + void *privs[UNWIND_MAX_CALLBACKS]; + struct unwind_stacktrace trace; + unsigned long pending; + u64 cookie =3D 0; + int i; + + BUILD_BUG_ON(UNWIND_MAX_CALLBACKS > 32); + + if (WARN_ON_ONCE(task !=3D current)) + return; + + if (WARN_ON_ONCE(!info->ctx_cookie || !info->pending_callbacks)) + return; + + scoped_guard(irqsave) { + pending =3D info->pending_callbacks; + cookie =3D info->ctx_cookie; + + info->pending_callbacks =3D 0; + info->ctx_cookie =3D 0; + memcpy(privs, info->privs, sizeof(void *) * UNWIND_MAX_CALLBACKS); + } + + if (!info->entries) { + info->entries =3D kmalloc(UNWIND_MAX_ENTRIES * sizeof(long), + GFP_KERNEL); + if (!info->entries) + return; + } + + trace.entries =3D info->entries; + trace.nr =3D 0; + unwind_user(&trace, UNWIND_MAX_ENTRIES); + + guard(rwsem_read)(&callbacks_rwsem); + + for_each_set_bit(i, &pending, UNWIND_MAX_CALLBACKS) { + if (callbacks[i]) + callbacks[i]->func(&trace, cookie, privs[i]); + } +} + +int unwind_user_register(struct unwind_callback *callback, unwind_callback= _t func) +{ + scoped_guard(rwsem_write, &callbacks_rwsem) { + for (int i =3D 0; i < UNWIND_MAX_CALLBACKS; i++) { + if (!callbacks[i]) { + callback->func =3D func; + callback->idx =3D i; + callbacks[i] =3D callback; + return 0; + } + } + } + + callback->func =3D NULL; + callback->idx =3D -1; + return -ENOSPC; +} + +int unwind_user_unregister(struct unwind_callback *callback) +{ + if (callback->idx < 0) + return -EINVAL; + + scoped_guard(rwsem_write, &callbacks_rwsem) + callbacks[callback->idx] =3D NULL; + + callback->func =3D NULL; + callback->idx =3D -1; + + return 0; +} + +void unwind_task_init(struct task_struct *task) +{ + struct unwind_task_info *info =3D &task->unwind_task_info; + + info->entries =3D NULL; + info->pending_callbacks =3D 0; + info->ctx_cookie =3D 0; + + memset(info->last_cookies, 0, sizeof(u64) * UNWIND_MAX_CALLBACKS); + memset(info->privs, 0, sizeof(u64) * UNWIND_MAX_CALLBACKS); + + init_task_work(&info->work, unwind_user_task_work); +} + +void unwind_task_free(struct task_struct *task) +{ + struct unwind_task_info *info =3D &task->unwind_task_info; + + kfree(info->entries); +} --=20 2.47.0