From nobody Wed Oct  8 07:35:07 2025
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org
 [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E78EC254AE1;
	Tue,  7 Oct 2025 21:39:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=10.30.226.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1759873172; cv=none;
 b=ngNX38pwCrDAguNSNjY5FOpFY84GWRwGMAQ6Rfe81QFODL2BzdetEnEYU0vfjX+EdSciETS5JI9Ry6Rbe/rG52PeAe5fvfvpUw+3ZPKHWBe7immTGVPME35sQrQvXAQ3QC6Ldn+02Ai/T0rzEzDyIMnfJ0E3dWSy6iIP1ILtlYI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1759873172; c=relaxed/simple;
	bh=zL+0EonX4yAc7B6yjhhjBannMKC0T+wNhNBCV0IzQeE=;
	h=Message-ID:Date:From:To:Cc:Subject:References:MIME-Version:
	 Content-Type;
 b=JQWAFGcUxUSUvedzooUh86Ra5Xa5iBqdEzFHZVgwlC8gDZ/42zI6bN5TMTDuIqWTwc8OrOagbNILXEj2Tvxv7Z83GyvWkxF6BRM+N947v+sh/Voo6vBYOtQUu7X6+3afpvY9Z+5xadGG3/BXWI+ZqIpWk3shBcCo0xH0cF+8PLQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b=EJ9fuBWT; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org
 header.b="EJ9fuBWT"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 48D76C4CEFE;
	Tue,  7 Oct 2025 21:39:31 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1759873171;
	bh=zL+0EonX4yAc7B6yjhhjBannMKC0T+wNhNBCV0IzQeE=;
	h=Date:From:To:Cc:Subject:References:From;
	b=EJ9fuBWTnqOj26UmHnoVgQ9ml5OG2YvYpbneUS3/AKdOhFgUI5Tz/Jl6NoLEf4zSv
	 X/U4GbgGewBKIEZCXlX9LipmjrVeUc2hCZmYbsjDbortRJrA6P+G6YM+UKQWCylFqp
	 NvYajZWan3a/k/TL3GYD97dCUGaSpVIBj2dDnw+JggLBKBgzyn+OLIkr59oXI3sAs9
	 q+dJeHNhdihvCOi47vmzKvGUXUV2Vw15pIMSkdLz24TpxVw5fnmW60bGYU7J0QLUia
	 CEUvu/UBitkZVCnLGS1tnjIFAnCdeN07i1FnQD6PRClu78vXG0R2rxqw7tzAgUibbq
	 1PLVktQ+cwctg==
Received: from rostedt by gandalf with local (Exim 4.98.2)
	(envelope-from <rostedt@kernel.org>)
	id 1v6FQx-00000007Xfg-2sCb;
	Tue, 07 Oct 2025 17:41:23 -0400
Message-ID: <20251007214123.537465618@kernel.org>
User-Agent: quilt/0.68
Date: Tue, 07 Oct 2025 17:40:09 -0400
From: Steven Rostedt <rostedt@kernel.org>
To: linux-kernel@vger.kernel.org,
 linux-trace-kernel@vger.kernel.org,
 bpf@vger.kernel.org,
 x86@kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>,
 Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
 Josh Poimboeuf <jpoimboe@kernel.org>,
 Peter Zijlstra <peterz@infradead.org>,
 Ingo Molnar <mingo@kernel.org>,
 Jiri Olsa <jolsa@kernel.org>,
 Arnaldo Carvalho de Melo <acme@kernel.org>,
 Namhyung Kim <namhyung@kernel.org>,
 Thomas Gleixner <tglx@linutronix.de>,
 Andrii Nakryiko <andrii@kernel.org>,
 Indu Bhagat <indu.bhagat@oracle.com>,
 "Jose E. Marchesi" <jemarch@gnu.org>,
 Beau Belgrave <beaub@linux.microsoft.com>,
 Jens Remus <jremus@linux.ibm.com>,
 Linus Torvalds <torvalds@linux-foundation.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 Florian Weimer <fweimer@redhat.com>,
 Sam James <sam@gentoo.org>,
 Kees Cook <kees@kernel.org>,
 "Carlos O'Donell" <codonell@redhat.com>
Subject: [PATCH v16 1/4] unwind: Add interface to allow tracing a single task
References: <20251007214008.080852573@kernel.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Steven Rostedt <rostedt@goodmis.org>

If a tracer (namely perf) is only tracing a single task, it doesn't need
the full functionality of the deferred stacktrace infrastructure. That
infrastructure has a limited number of users as it needs to handle
multiple tracers that can trace multiple tasks at the same time, creating
a multi to multi relationship.

But for a tracer that is tracing a single task, that creates a single to
multi relationship (a tracer tracing a single task and a task that can
have several tracers tracing it). This allows for data to be allocated at
the time of initialization to let the tracer use its own task_work data
structures to attach to the task.

Add a new interface called unwind_deferred_task_init() that works similar
to the unwind_deferred_init(), but this interface is used when the tracer
will only ever trace a single task at the same time.

The unwind_work descriptor that is initialized during this init function
now has a struct callback_head field that is used to attach itself to a
task_work. The work->bit for this task is set to the UNWIND_PENDING_BIT
(but defined as UNWIND_TASK) to differentiate it from unwind_works that
are tracing any task, as their work->bit will be one of the allocated
bits in the unwind_mask.

The rest of the calls are the same. That is, the unwind_deferred_request()
and unwind_deferred_cancel() are called on this, and these functions will
know that this unwind_work descriptor traces a single task. Even the
callback function works the same.

If a tracer were to try to use this unwind_work on multiple tasks at the
same time, it will simply fail to attach to the second task if one is
already pending a deferred stacktrace, and a WARN_ON is produced.

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 include/linux/unwind_deferred.h |  15 ++
 kernel/unwind/deferred.c        | 283 +++++++++++++++++++++++++++-----
 2 files changed, 255 insertions(+), 43 deletions(-)

diff --git a/include/linux/unwind_deferred.h b/include/linux/unwind_deferre=
d.h
index f4743c8cff4c..6f0f04ba538d 100644
--- a/include/linux/unwind_deferred.h
+++ b/include/linux/unwind_deferred.h
@@ -2,6 +2,7 @@
 #ifndef _LINUX_UNWIND_USER_DEFERRED_H
 #define _LINUX_UNWIND_USER_DEFERRED_H
=20
+#include <linux/rcuwait.h>
 #include <linux/task_work.h>
 #include <linux/unwind_user.h>
 #include <linux/unwind_deferred_types.h>
@@ -15,6 +16,9 @@ typedef void (*unwind_callback_t)(struct unwind_work *wor=
k,
 struct unwind_work {
 	struct list_head		list;
 	unwind_callback_t		func;
+	struct callback_head 		work;
+	struct task_struct		*task;
+	struct rcuwait			wait;
 	int				bit;
 };
=20
@@ -32,11 +36,22 @@ enum {
 	UNWIND_USED		=3D BIT(UNWIND_USED_BIT)
 };
=20
+/*
+ * UNWIND_PENDING is set in the task's info->unwind_mask when
+ * a deferred unwind is requested on that task. If the unwind
+ * descriptor is used only to trace a specific task, it's bit
+ * is the UNWIND_PENDING_BIT. This gets set as the work->bit
+ * and is to distinguish unwind_work descriptors that trace
+ * a single task from those that trace all tasks.
+ */
+#define UNWIND_TASK	UNWIND_PENDING_BIT
+
 void unwind_task_init(struct task_struct *task);
 void unwind_task_free(struct task_struct *task);
=20
 int unwind_user_faultable(struct unwind_stacktrace *trace);
=20
+int unwind_deferred_task_init(struct unwind_work *work, unwind_callback_t =
func);
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func);
 int unwind_deferred_request(struct unwind_work *work, u64 *cookie);
 void unwind_deferred_cancel(struct unwind_work *work);
diff --git a/kernel/unwind/deferred.c b/kernel/unwind/deferred.c
index ceeeff562302..f34b60713a4b 100644
--- a/kernel/unwind/deferred.c
+++ b/kernel/unwind/deferred.c
@@ -44,6 +44,7 @@ static inline bool try_assign_cnt(struct unwind_task_info=
 *info, u32 cnt)
 /* Guards adding to or removing from the list of callbacks */
 static DEFINE_MUTEX(callback_mutex);
 static LIST_HEAD(callbacks);
+static LIST_HEAD(task_callbacks);
=20
 #define RESERVED_BITS	(UNWIND_PENDING | UNWIND_USED)
=20
@@ -155,12 +156,18 @@ static void process_unwind_deferred(struct task_struc=
t *task)
 	unsigned long bits;
 	u64 cookie;
=20
-	if (WARN_ON_ONCE(!unwind_pending(info)))
-		return;
-
 	/* Clear pending bit but make sure to have the current bits */
 	bits =3D atomic_long_fetch_andnot(UNWIND_PENDING,
 					&info->unwind_mask);
+
+	/* Remove the callbacks that were already completed */
+	if (info->cache)
+		bits &=3D ~(info->cache->unwind_completed);
+
+	/* If all callbacks have already been done, there's nothing to do */
+	if (!bits)
+		return;
+
 	/*
 	 * From here on out, the callback must always be called, even if it's
 	 * just an empty trace.
@@ -170,9 +177,6 @@ static void process_unwind_deferred(struct task_struct =
*task)
=20
 	unwind_user_faultable(&trace);
=20
-	if (info->cache)
-		bits &=3D ~(info->cache->unwind_completed);
-
 	cookie =3D info->id.id;
=20
 	guard(srcu)(&unwind_srcu);
@@ -186,11 +190,95 @@ static void process_unwind_deferred(struct task_struc=
t *task)
 	}
 }
=20
-static void unwind_deferred_task_work(struct callback_head *head)
+/* Callback for an unwind work that traces all tasks */
+static void unwind_deferred_work(struct callback_head *head)
 {
 	process_unwind_deferred(current);
 }
=20
+/* Get the trace for an unwind work that traces a single task */
+static void get_deferred_task_stacktrace(struct task_struct *task,
+					 struct unwind_stacktrace *trace,
+					 u64 *cookie, bool clear_pending)
+{
+	struct unwind_task_info *info =3D &task->unwind_info;
+
+	if (clear_pending)
+		atomic_long_andnot(UNWIND_PENDING, &info->unwind_mask);
+
+	trace->nr =3D 0;
+	trace->entries =3D NULL;
+
+	unwind_user_faultable(trace);
+
+	*cookie =3D info->id.id;
+}
+
+/* Callback for an unwind work that only traces this task */
+static void unwind_deferred_task_work(struct callback_head *head)
+{
+	struct unwind_work *work =3D container_of(head, struct unwind_work, work);
+	struct unwind_task_info *info =3D &current->unwind_info;
+	struct unwind_stacktrace trace;
+	u64 cookie;
+
+	guard(srcu)(&unwind_srcu);
+
+	/* Always clear the pending bit when this is called */
+	atomic_long_andnot(UNWIND_PENDING, &info->unwind_mask);
+
+	/* Is this work being canceled? */
+	if (unlikely(work->bit < 0))
+		work->task =3D NULL;
+
+	if (!work->task)
+		goto out;
+
+	/*
+	 * From here on out, the callback must always be called, even if it's
+	 * just an empty trace.
+	 */
+	get_deferred_task_stacktrace(current, &trace, &cookie, false);
+	work->func(work, &trace, cookie);
+	work->task =3D NULL;
+out:
+	/* Synchronize with cancel_unwind_task() */
+	rcuwait_wake_up(&work->wait);
+}
+
+/* Flush any pending work for an exiting task */
+static void process_unwind_tasks(struct task_struct *task)
+{
+	struct unwind_stacktrace trace;
+	struct unwind_work *work;
+	u64 cookie =3D 0;
+
+	guard(srcu)(&unwind_srcu);
+
+	/* The task is exiting, flush any pending per task unwind works */
+	list_for_each_entry_srcu(work, &task_callbacks, list,
+				 srcu_read_lock_held(&unwind_srcu)) {
+		if (work->task !=3D task)
+			continue;
+
+		/* There may be waiters in cancel_unwind_task() */
+		if (work->bit < 0)
+			goto wakeup;
+
+		task_work_cancel(task, &work->work);
+
+		/* Only need to get the trace once */
+		if (!cookie)
+			get_deferred_task_stacktrace(task, &trace,
+						     &cookie, true);
+		work->func(work, &trace, cookie);
+wakeup:
+		work->task =3D NULL;
+		/* Synchronize with cancel_unwind_task() */
+		rcuwait_wake_up(&work->wait);
+	}
+}
+
 void unwind_deferred_task_exit(struct task_struct *task)
 {
 	struct unwind_task_info *info =3D &current->unwind_info;
@@ -199,10 +287,80 @@ void unwind_deferred_task_exit(struct task_struct *ta=
sk)
 		return;
=20
 	process_unwind_deferred(task);
+	process_unwind_tasks(task);
=20
 	task_work_cancel(task, &info->work);
 }
=20
+static int queue_unwind_task(struct unwind_work *work, int twa_mode,
+			     struct unwind_task_info *info)
+{
+	struct task_struct *task =3D READ_ONCE(work->task);
+	int ret;
+
+	if (task) {
+		/* Did the tracer break its contract? */
+		WARN_ON_ONCE(task !=3D current);
+		return 1;
+	}
+
+	if (!try_cmpxchg(&work->task, &task, current))
+		return 1;
+
+	/* The work has been claimed, now schedule it. */
+	ret =3D task_work_add(current, &work->work, twa_mode);
+
+	if (WARN_ON_ONCE(ret))
+		work->task =3D NULL;
+	else
+		atomic_long_or(UNWIND_PENDING, &info->unwind_mask);
+
+	return ret;
+}
+
+static int queue_unwind_work(struct unwind_work *work, int twa_mode,
+			     struct unwind_task_info *info)
+{
+	unsigned long bit =3D BIT(work->bit);
+	unsigned long old, bits;
+	int ret;
+
+	/* Check if the unwind_work only traces this task */
+	if (work->bit =3D=3D UNWIND_TASK)
+		return queue_unwind_task(work, twa_mode, info);
+
+	old =3D atomic_long_read(&info->unwind_mask);
+
+	/* Is this already queued or executed */
+	if (old & bit)
+		return 1;
+
+	/*
+	 * This work's bit hasn't been set yet. Now set it with the PENDING
+	 * bit and fetch the current value of unwind_mask. If ether the
+	 * work's bit or PENDING was already set, then this is already queued
+	 * to have a callback.
+	 */
+	bits =3D UNWIND_PENDING | bit;
+	old =3D atomic_long_fetch_or(bits, &info->unwind_mask);
+	if (old & bits) {
+		/*
+		 * If the work's bit was set, whatever set it had better
+		 * have also set pending and queued a callback.
+		 */
+		WARN_ON_ONCE(!(old & UNWIND_PENDING));
+		return old & bit;
+	}
+
+	/* The work has been claimed, now schedule it. */
+	ret =3D task_work_add(current, &info->work, twa_mode);
+
+	if (WARN_ON_ONCE(ret))
+		atomic_long_set(&info->unwind_mask, 0);
+
+	return ret;
+}
+
 /**
  * unwind_deferred_request - Request a user stacktrace on task kernel exit
  * @work: Unwind descriptor requesting the trace
@@ -232,9 +390,6 @@ int unwind_deferred_request(struct unwind_work *work, u=
64 *cookie)
 {
 	struct unwind_task_info *info =3D &current->unwind_info;
 	int twa_mode =3D TWA_RESUME;
-	unsigned long old, bits;
-	unsigned long bit;
-	int ret;
=20
 	*cookie =3D 0;
=20
@@ -254,47 +409,45 @@ int unwind_deferred_request(struct unwind_work *work,=
 u64 *cookie)
 	}
=20
 	/* Do not allow cancelled works to request again */
-	bit =3D READ_ONCE(work->bit);
-	if (WARN_ON_ONCE(bit < 0))
+	if (WARN_ON_ONCE(READ_ONCE(work->bit) < 0))
 		return -EINVAL;
=20
-	/* Only need the mask now */
-	bit =3D BIT(bit);
-
 	guard(irqsave)();
=20
 	*cookie =3D get_cookie(info);
=20
-	old =3D atomic_long_read(&info->unwind_mask);
+	return queue_unwind_work(work, twa_mode, info);
+}
=20
-	/* Is this already queued or executed */
-	if (old & bit)
-		return 1;
+static void cancel_unwind_task(struct unwind_work *work)
+{
+	struct task_struct *task;
=20
-	/*
-	 * This work's bit hasn't been set yet. Now set it with the PENDING
-	 * bit and fetch the current value of unwind_mask. If ether the
-	 * work's bit or PENDING was already set, then this is already queued
-	 * to have a callback.
-	 */
-	bits =3D UNWIND_PENDING | bit;
-	old =3D atomic_long_fetch_or(bits, &info->unwind_mask);
-	if (old & bits) {
+	task =3D READ_ONCE(work->task);
+
+	if (!task || !task_work_cancel(task, &work->work)) {
 		/*
-		 * If the work's bit was set, whatever set it had better
-		 * have also set pending and queued a callback.
+		 * If the task_work_cancel() fails to cancel it could mean that
+		 * the task_work is just about to execute. This needs to wait
+		 * until the work->func() is finished before returning.
+		 * This is required because the SRCU section may not have been
+		 * entered yet, and the synchronize_srcu() will not wait for it.
 		 */
-		WARN_ON_ONCE(!(old & UNWIND_PENDING));
-		return old & bit;
+		if (task) {
+			rcuwait_wait_event(&work->wait, work->task =3D=3D NULL,
+				   TASK_UNINTERRUPTIBLE);
+		}
 	}
=20
-	/* The work has been claimed, now schedule it. */
-	ret =3D task_work_add(current, &info->work, twa_mode);
-
-	if (WARN_ON_ONCE(ret))
-		atomic_long_set(&info->unwind_mask, 0);
+	/*
+	 * Needed to protect loop in process_unwind_tasks().
+	 * This also guarantees that unwind_deferred_task_work() is
+	 * completely done and the work structure is no longer referenced.
+	 */
+	synchronize_srcu(&unwind_srcu);
=20
-	return ret;
+	/* Still set task to NULL if task_work_cancel() succeeded */
+	work->task =3D NULL;
 }
=20
 void unwind_deferred_cancel(struct unwind_work *work)
@@ -307,16 +460,24 @@ void unwind_deferred_cancel(struct unwind_work *work)
=20
 	bit =3D work->bit;
=20
-	/* No work should be using a reserved bit */
-	if (WARN_ON_ONCE(BIT(bit) & RESERVED_BITS))
+	/* Was it initialized ? */
+	if (!bit)
 		return;
=20
-	guard(mutex)(&callback_mutex);
-	list_del_rcu(&work->list);
+	scoped_guard(mutex, &callback_mutex) {
+		list_del_rcu(&work->list);
+	}
=20
 	/* Do not allow any more requests and prevent callbacks */
 	work->bit =3D -1;
=20
+	if (bit =3D=3D UNWIND_TASK)
+		return cancel_unwind_task(work);
+
+	/* No work should be using a reserved bit */
+	if (WARN_ON_ONCE(BIT(bit) & RESERVED_BITS))
+		return;
+
 	__clear_bit(bit, &unwind_mask);
=20
 	synchronize_srcu(&unwind_srcu);
@@ -330,6 +491,17 @@ void unwind_deferred_cancel(struct unwind_work *work)
 	}
 }
=20
+/**
+ * unwind_deferred_init - Init unwind_work that traces any task
+ * @work: The unwind_work descriptor to initialize
+ * @func: The callback function that will have the stacktrace
+ *
+ * Initialize a work that can trace any task. There's only a limited
+ * number of these that can be allocated.
+ *
+ * Returns 0 on success or -EBUSY if the limit of these unwind_works have
+ *   been exceeded.
+ */
 int unwind_deferred_init(struct unwind_work *work, unwind_callback_t func)
 {
 	memset(work, 0, sizeof(*work));
@@ -348,12 +520,37 @@ int unwind_deferred_init(struct unwind_work *work, un=
wind_callback_t func)
 	return 0;
 }
=20
+/**
+ * unwind_deferred_task_init - Init unwind_work that traces a single task
+ * @work: The unwind_work descriptor to initialize
+ * @func: The callback function that will have the stacktrace
+ *
+ * Initialize a work that will always trace only a single task. It is
+ * up to the caller to make sure that the unwind_deferred_requeust()
+ * will always be called on the same task for the @work descriptor.
+ *
+ * Note, unlike unwind_deferred_init() there is no limit of these works
+ * that can be initialized and used.
+ */
+int unwind_deferred_task_init(struct unwind_work *work, unwind_callback_t =
func)
+{
+	memset(work, 0, sizeof(*work));
+	work->bit =3D UNWIND_TASK;
+	init_task_work(&work->work, unwind_deferred_task_work);
+	work->func =3D func;
+	rcuwait_init(&work->wait);
+
+	guard(mutex)(&callback_mutex);
+	list_add_rcu(&work->list, &task_callbacks);
+	return 0;
+}
+
 void unwind_task_init(struct task_struct *task)
 {
 	struct unwind_task_info *info =3D &task->unwind_info;
=20
 	memset(info, 0, sizeof(*info));
-	init_task_work(&info->work, unwind_deferred_task_work);
+	init_task_work(&info->work, unwind_deferred_work);
 	atomic_long_set(&info->unwind_mask, 0);
 }
=20
--=20
2.50.1