[tip: sched/core] sched/tracepoints: Move and extend the sched_process_exit() tracepoint

tip-bot2 for Andrii Nakryiko posted 1 patch 10 months, 1 week ago
include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++----
kernel/exit.c                |  2 +-
2 files changed, 31 insertions(+), 5 deletions(-)
[tip: sched/core] sched/tracepoints: Move and extend the sched_process_exit() tracepoint
Posted by tip-bot2 for Andrii Nakryiko 10 months, 1 week ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     3e816361e94a0e79b1aabf44abec552e9698b196
Gitweb:        https://git.kernel.org/tip/3e816361e94a0e79b1aabf44abec552e9698b196
Author:        Andrii Nakryiko <andrii@kernel.org>
AuthorDate:    Wed, 02 Apr 2025 11:09:25 -07:00
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Fri, 04 Apr 2025 10:30:19 +02:00

sched/tracepoints: Move and extend the sched_process_exit() tracepoint

It is useful to be able to access current->mm at task exit to, say,
record a bunch of VMA information right before the task exits (e.g., for
stack symbolization reasons when dealing with short-lived processes that
exit in the middle of profiling session). Currently,
trace_sched_process_exit() is triggered after exit_mm() which resets
current->mm to NULL making this tracepoint unsuitable for inspecting
and recording task's mm_struct-related data when tracing process
lifetimes.

There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm() and other
exit_*() resource teardowns. taskstats performs a similar kind of
accounting that some applications do with BPF, and so co-locating them
seems like a good fit. So that's where trace_sched_process_exit() is
moved with this patch.

Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some
of our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.

Note that we can't use sched_process_template anymore, and so we use
TRACE_EVENT()-based tracepoint definition. But all the field names and
order, as well as assign and output logic remain intact. We just add one
extra field at the end in backwards-compatible way.

Document the dependency to sched_process_template anyway.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250402180925.90914-1-andrii@kernel.org
---
 include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++----
 kernel/exit.c                |  2 +-
 2 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97..3bec9fb 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -326,11 +326,37 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
 	     TP_ARGS(p));
 
 /*
- * Tracepoint for a task exiting:
+ * Tracepoint for a task exiting.
+ * Note, it's a superset of sched_process_template and should be kept
+ * compatible as much as possible. sched_process_exits has an extra
+ * `group_dead` argument, so sched_process_template can't be used,
+ * unfortunately, just like sched_migrate_task above.
  */
-DEFINE_EVENT(sched_process_template, sched_process_exit,
-	     TP_PROTO(struct task_struct *p),
-	     TP_ARGS(p));
+TRACE_EVENT(sched_process_exit,
+
+	TP_PROTO(struct task_struct *p, bool group_dead),
+
+	TP_ARGS(p, group_dead),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	int,	prio			)
+		__field(	bool,	group_dead		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
+		__entry->group_dead	= group_dead;
+	),
+
+	TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
+		  __entry->comm, __entry->pid, __entry->prio,
+		  __entry->group_dead ? "true" : "false"
+	)
+);
 
 /*
  * Tracepoint for waiting on task to unschedule:
diff --git a/kernel/exit.c b/kernel/exit.c
index 1b51dc0..f1db86d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -936,12 +936,12 @@ void __noreturn do_exit(long code)
 
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
+	trace_sched_process_exit(tsk, group_dead);
 
 	exit_mm();
 
 	if (group_dead)
 		acct_process();
-	trace_sched_process_exit(tsk);
 
 	exit_sem(tsk);
 	exit_shm(tsk);
Re: [tip: sched/core] sched/tracepoints: Move and extend the sched_process_exit() tracepoint
Posted by Andrii Nakryiko 10 months, 1 week ago
On Fri, Apr 4, 2025 at 1:37 AM tip-bot2 for Andrii Nakryiko
<tip-bot2@linutronix.de> wrote:
>
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID:     3e816361e94a0e79b1aabf44abec552e9698b196
> Gitweb:        https://git.kernel.org/tip/3e816361e94a0e79b1aabf44abec552e9698b196
> Author:        Andrii Nakryiko <andrii@kernel.org>
> AuthorDate:    Wed, 02 Apr 2025 11:09:25 -07:00
> Committer:     Ingo Molnar <mingo@kernel.org>
> CommitterDate: Fri, 04 Apr 2025 10:30:19 +02:00
>
> sched/tracepoints: Move and extend the sched_process_exit() tracepoint
>
> It is useful to be able to access current->mm at task exit to, say,
> record a bunch of VMA information right before the task exits (e.g., for
> stack symbolization reasons when dealing with short-lived processes that
> exit in the middle of profiling session). Currently,
> trace_sched_process_exit() is triggered after exit_mm() which resets
> current->mm to NULL making this tracepoint unsuitable for inspecting
> and recording task's mm_struct-related data when tracing process
> lifetimes.
>
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm() and other
> exit_*() resource teardowns. taskstats performs a similar kind of
> accounting that some applications do with BPF, and so co-locating them
> seems like a good fit. So that's where trace_sched_process_exit() is
> moved with this patch.
>
> Also, existing trace_sched_process_exit() tracepoint is notoriously
> missing `group_dead` flag that is certainly useful in practice and some
> of our production applications have to work around this. So plumb
> `group_dead` through while at it, to have a richer and more complete
> tracepoint.
>
> Note that we can't use sched_process_template anymore, and so we use
> TRACE_EVENT()-based tracepoint definition. But all the field names and
> order, as well as assign and output logic remain intact. We just add one
> extra field at the end in backwards-compatible way.
>
> Document the dependency to sched_process_template anyway.
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

Adding Andrew.

Seems like my patch was applied both by Andrew ([0], [1]) and Ingo.
Andew, would it be possible to drop those from your tree and keep the
one in Ingo's tip/sched/core? Thanks!

  [0] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/exit-move-and-extend-sched_process_exit-tracepoint.patch
  [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/exit-move-and-extend-sched_process_exit-tracepoint-fix.patch

> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> Acked-by: Oleg Nesterov <oleg@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lore.kernel.org/r/20250402180925.90914-1-andrii@kernel.org
> ---
>  include/trace/events/sched.h | 34 ++++++++++++++++++++++++++++++----
>  kernel/exit.c                |  2 +-
>  2 files changed, 31 insertions(+), 5 deletions(-)
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97..3bec9fb 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -326,11 +326,37 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
>              TP_ARGS(p));
>
>  /*
> - * Tracepoint for a task exiting:
> + * Tracepoint for a task exiting.
> + * Note, it's a superset of sched_process_template and should be kept
> + * compatible as much as possible. sched_process_exits has an extra
> + * `group_dead` argument, so sched_process_template can't be used,
> + * unfortunately, just like sched_migrate_task above.
>   */
> -DEFINE_EVENT(sched_process_template, sched_process_exit,
> -            TP_PROTO(struct task_struct *p),
> -            TP_ARGS(p));
> +TRACE_EVENT(sched_process_exit,
> +
> +       TP_PROTO(struct task_struct *p, bool group_dead),
> +
> +       TP_ARGS(p, group_dead),
> +
> +       TP_STRUCT__entry(
> +               __array(        char,   comm,   TASK_COMM_LEN   )
> +               __field(        pid_t,  pid                     )
> +               __field(        int,    prio                    )
> +               __field(        bool,   group_dead              )
> +       ),
> +
> +       TP_fast_assign(
> +               memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> +               __entry->pid            = p->pid;
> +               __entry->prio           = p->prio; /* XXX SCHED_DEADLINE */
> +               __entry->group_dead     = group_dead;
> +       ),
> +
> +       TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
> +                 __entry->comm, __entry->pid, __entry->prio,
> +                 __entry->group_dead ? "true" : "false"
> +       )
> +);
>
>  /*
>   * Tracepoint for waiting on task to unschedule:
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 1b51dc0..f1db86d 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -936,12 +936,12 @@ void __noreturn do_exit(long code)
>
>         tsk->exit_code = code;
>         taskstats_exit(tsk, group_dead);
> +       trace_sched_process_exit(tsk, group_dead);
>
>         exit_mm();
>
>         if (group_dead)
>                 acct_process();
> -       trace_sched_process_exit(tsk);
>
>         exit_sem(tsk);
>         exit_shm(tsk);