[PATCH v2] exit: move and extend sched_process_exit() tracepoint

Andrii Nakryiko posted 1 patch 1 day ago
include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
kernel/exit.c                |  2 +-
2 files changed, 26 insertions(+), 4 deletions(-)
[PATCH v2] exit: move and extend sched_process_exit() tracepoint
Posted by Andrii Nakryiko 1 day ago
It is useful to be able to access current->mm at task exit to, say,
record a bunch of VMA information right before the task exits (e.g., for
stack symbolization reasons when dealing with short-lived processes that
exit in the middle of profiling session). Currently,
trace_sched_process_exit() is triggered after exit_mm() which resets
current->mm to NULL making this tracepoint unsuitable for inspecting
and recording task's mm_struct-related data when tracing process
lifetimes.

There is a particularly suitable place, though, right after
taskstats_exit() is called, but before we do exit_mm() and other
exit_*() resource teardowns. taskstats performs a similar kind of
accounting that some applications do with BPF, and so co-locating them
seems like a good fit. So that's where trace_sched_process_exit() is
moved with this patch.

Also, existing trace_sched_process_exit() tracepoint is notoriously
missing `group_dead` flag that is certainly useful in practice and some
of our production applications have to work around this. So plumb
`group_dead` through while at it, to have a richer and more complete
tracepoint.

Note that we can't use sched_process_template anymore, and so we use
TRACE_EVENT()-based tracepoint definition. But all the field names and
order, as well as assign and output logic remain intact. We just add one
extra field at the end in backwards-compatible way.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
 kernel/exit.c                |  2 +-
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 8994e97d86c1..05a14f2b35c3 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -328,9 +328,31 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
 /*
  * Tracepoint for a task exiting:
  */
-DEFINE_EVENT(sched_process_template, sched_process_exit,
-	     TP_PROTO(struct task_struct *p),
-	     TP_ARGS(p));
+TRACE_EVENT(sched_process_exit,
+
+	TP_PROTO(struct task_struct *p, bool group_dead),
+
+	TP_ARGS(p, group_dead),
+
+	TP_STRUCT__entry(
+		__array(	char,	comm,	TASK_COMM_LEN	)
+		__field(	pid_t,	pid			)
+		__field(	int,	prio			)
+		__field(	bool,	group_dead		)
+	),
+
+	TP_fast_assign(
+		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
+		__entry->pid		= p->pid;
+		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
+		__entry->group_dead	= group_dead;
+	),
+
+	TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
+		  __entry->comm, __entry->pid, __entry->prio,
+		  __entry->group_dead ? "true" : "false"
+	)
+);
 
 /*
  * Tracepoint for waiting on task to unschedule:
diff --git a/kernel/exit.c b/kernel/exit.c
index c2e6c7b7779f..4abd307b1586 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -937,12 +937,12 @@ void __noreturn do_exit(long code)
 
 	tsk->exit_code = code;
 	taskstats_exit(tsk, group_dead);
+	trace_sched_process_exit(tsk, group_dead);
 
 	exit_mm();
 
 	if (group_dead)
 		acct_process();
-	trace_sched_process_exit(tsk);
 
 	exit_sem(tsk);
 	exit_shm(tsk);
-- 
2.47.1
Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
Posted by Ingo Molnar 4 hours ago
* Andrii Nakryiko <andrii@kernel.org> wrote:

> It is useful to be able to access current->mm at task exit to, say,
> record a bunch of VMA information right before the task exits (e.g., for
> stack symbolization reasons when dealing with short-lived processes that
> exit in the middle of profiling session). Currently,
> trace_sched_process_exit() is triggered after exit_mm() which resets
> current->mm to NULL making this tracepoint unsuitable for inspecting
> and recording task's mm_struct-related data when tracing process
> lifetimes.
> 
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm() and other
> exit_*() resource teardowns. taskstats performs a similar kind of
> accounting that some applications do with BPF, and so co-locating them
> seems like a good fit. So that's where trace_sched_process_exit() is
> moved with this patch.
> 
> Also, existing trace_sched_process_exit() tracepoint is notoriously
> missing `group_dead` flag that is certainly useful in practice and some
> of our production applications have to work around this. So plumb
> `group_dead` through while at it, to have a richer and more complete
> tracepoint.
> 
> Note that we can't use sched_process_template anymore, and so we use
> TRACE_EVENT()-based tracepoint definition.

 But all the field names and
> order, as well as assign and output logic remain intact. We just add one
> extra field at the end in backwards-compatible way.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
>  kernel/exit.c                |  2 +-
>  2 files changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 8994e97d86c1..05a14f2b35c3 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -328,9 +328,31 @@ DEFINE_EVENT(sched_process_template, sched_process_free,
>  /*
>   * Tracepoint for a task exiting:
>   */
> -DEFINE_EVENT(sched_process_template, sched_process_exit,
> -	     TP_PROTO(struct task_struct *p),
> -	     TP_ARGS(p));
> +TRACE_EVENT(sched_process_exit,
> +
> +	TP_PROTO(struct task_struct *p, bool group_dead),
> +
> +	TP_ARGS(p, group_dead),
> +
> +	TP_STRUCT__entry(
> +		__array(	char,	comm,	TASK_COMM_LEN	)
> +		__field(	pid_t,	pid			)
> +		__field(	int,	prio			)
> +		__field(	bool,	group_dead		)
> +	),
> +
> +	TP_fast_assign(
> +		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
> +		__entry->pid		= p->pid;
> +		__entry->prio		= p->prio; /* XXX SCHED_DEADLINE */
> +		__entry->group_dead	= group_dead;
> +	),
> +
> +	TP_printk("comm=%s pid=%d prio=%d group_dead=%s",
> +		  __entry->comm, __entry->pid, __entry->prio,
> +		  __entry->group_dead ? "true" : "false"
> +	)

This feels really fragile, could you please at least add a comment that 
points out that this is basically an extension of 
sched_process_template, and that it should remain a subset of it, or 
something to that end?

Thanks,

	Ingo
Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
Posted by Steven Rostedt 3 hours ago
On Thu, 3 Apr 2025 15:54:22 +0200
Ingo Molnar <mingo@kernel.org> wrote:

> This feels really fragile, could you please at least add a comment that 
> points out that this is basically an extension of 
> sched_process_template, and that it should remain a subset of it, or 
> something to that end?

Is there any dependency on this?

The reason to use the templates is because it saves memory. Each
TRACE_EVENT() can add ~5k (which a TRACE_EVENT() is really just a
DECLARE_EVENT_CLASS() + DEFINE_EVENT() for a single event).

Each DEFINE_EVENT() just adds around 250 bytes. Hence, if you have multiple
events that share the same fields and output, it's much more memory
efficient to use the CLASS and EVENT logic then making each their own
TRACE_EVENT().

I don't know of any other dependency to why this was a template other than
to save memory.

-- Steve
Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
Posted by Ingo Molnar 3 hours ago
* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 3 Apr 2025 15:54:22 +0200
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > This feels really fragile, could you please at least add a comment 
> > that points out that this is basically an extension of 
> > sched_process_template, and that it should remain a subset of it, 
> > or something to that end?
> 
> Is there any dependency on this?
> 
> I don't know of any other dependency to why this was a template other than
> to save memory.

Uhm, to state the obvious: to not replicate the same definitions over 
and over again three times times, for 3 scheduler tracepoints that 
share the record format?

Removing just a single sched_process_template use bloats the source and 
adds in potential fragility:

 2 files changed, 26 insertions(+), 4 deletions(-)

So my request is to please at least add a comment that points the 
reader to the shared record format between sched_process_exit and the 
other two tracepoints.

Thanks,

	Ingo
Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
Posted by Andrii Nakryiko 2 hours ago
On Thu, Apr 3, 2025 at 8:12 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > On Thu, 3 Apr 2025 15:54:22 +0200
> > Ingo Molnar <mingo@kernel.org> wrote:
> >
> > > This feels really fragile, could you please at least add a comment
> > > that points out that this is basically an extension of
> > > sched_process_template, and that it should remain a subset of it,
> > > or something to that end?
> >
> > Is there any dependency on this?
> >
> > I don't know of any other dependency to why this was a template other than
> > to save memory.
>
> Uhm, to state the obvious: to not replicate the same definitions over
> and over again three times times, for 3 scheduler tracepoints that
> share the record format?
>
> Removing just a single sched_process_template use bloats the source and
> adds in potential fragility:
>
>  2 files changed, 26 insertions(+), 4 deletions(-)
>
> So my request is to please at least add a comment that points the
> reader to the shared record format between sched_process_exit and the
> other two tracepoints.

Sounds good, no problem. I'll send a follow up patch which Andrew can
fold, if he prefers.

>
> Thanks,
>
>         Ingo
Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
Posted by Oleg Nesterov 22 hours ago
On 04/02, Andrii Nakryiko wrote:
>
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -937,12 +937,12 @@ void __noreturn do_exit(long code)
>  
>  	tsk->exit_code = code;
>  	taskstats_exit(tsk, group_dead);
> +	trace_sched_process_exit(tsk, group_dead);
>  
>  	exit_mm();
>  
>  	if (group_dead)
>  		acct_process();
> -	trace_sched_process_exit(tsk);

I see nothing wrong in this change.

(and I think that the current placement of trace_sched_process_exit() was
 more or less "random").

Acked-by: Oleg Nesterov <oleg@redhat.com>
Re: [PATCH v2] exit: move and extend sched_process_exit() tracepoint
Posted by Steven Rostedt 23 hours ago
On Wed,  2 Apr 2025 11:09:25 -0700
Andrii Nakryiko <andrii@kernel.org> wrote:

> It is useful to be able to access current->mm at task exit to, say,
> record a bunch of VMA information right before the task exits (e.g., for
> stack symbolization reasons when dealing with short-lived processes that
> exit in the middle of profiling session). Currently,
> trace_sched_process_exit() is triggered after exit_mm() which resets
> current->mm to NULL making this tracepoint unsuitable for inspecting
> and recording task's mm_struct-related data when tracing process
> lifetimes.
> 
> There is a particularly suitable place, though, right after
> taskstats_exit() is called, but before we do exit_mm() and other
> exit_*() resource teardowns. taskstats performs a similar kind of
> accounting that some applications do with BPF, and so co-locating them
> seems like a good fit. So that's where trace_sched_process_exit() is
> moved with this patch.
> 
> Also, existing trace_sched_process_exit() tracepoint is notoriously
> missing `group_dead` flag that is certainly useful in practice and some
> of our production applications have to work around this. So plumb
> `group_dead` through while at it, to have a richer and more complete
> tracepoint.
> 
> Note that we can't use sched_process_template anymore, and so we use
> TRACE_EVENT()-based tracepoint definition. But all the field names and
> order, as well as assign and output logic remain intact. We just add one
> extra field at the end in backwards-compatible way.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  include/trace/events/sched.h | 28 +++++++++++++++++++++++++---
>  kernel/exit.c                |  2 +-
>  2 files changed, 26 insertions(+), 4 deletions(-)

Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve