[PATCH] taskstats: retain dead thread stats in TGID queries

Yiyang Chen posted 1 patch 6 days, 22 hours ago
kernel/taskstats.c | 62 ++++++++++++++++++++++++----------------------
1 file changed, 33 insertions(+), 29 deletions(-)
[PATCH] taskstats: retain dead thread stats in TGID queries
Posted by Yiyang Chen 6 days, 22 hours ago
fill_stats_for_tgid() builds TGID stats from two sources: the cached
aggregate in signal->stats and a scan of the live threads in the group.

However, fill_tgid_exit() only accumulates delay accounting into
signal->stats. This means TGID queries lose the fields that
fill_stats_for_tgid() adds for live threads once a thread exits,
including ac_etime, ac_utime, ac_stime, nvcsw and nivcsw.

Factor the per-task TGID accumulation into a helper and use it in both
fill_stats_for_tgid() and fill_tgid_exit(). This keeps the fields
retained for dead threads aligned with the fields already accounted for
live threads, and follows the existing taskstats TGID aggregation model,
which already accumulates delay accounting in fill_tgid_exit() and
combines it with a live-thread scan in fill_stats_for_tgid().

Signed-off-by: Yiyang Chen <cyyzero16@gmail.com>
---
 kernel/taskstats.c | 62 ++++++++++++++++++++++++----------------------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 0cd680ccc7e5..2a0ab42e4edd 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -210,13 +210,39 @@ static int fill_stats_for_pid(pid_t pid, struct taskstats *stats)
 	return 0;
 }
 
+static void tgid_stats_add_task(struct taskstats *stats,
+				     struct task_struct *tsk, u64 now_ns)
+{
+	u64 delta, utime, stime;
+
+	/*
+	 * Each accounting subsystem calls its functions here to
+	 * accumulate its per-task stats for tsk, into the per-tgid structure
+	 *
+	 *	per-task-foo(tsk->signal->stats, tsk);
+	 */
+	delayacct_add_tsk(stats, tsk);
+
+	/* calculate task elapsed time in nsec */
+	delta = now_ns - tsk->start_time;
+	/* Convert to micro seconds */
+	do_div(delta, NSEC_PER_USEC);
+	stats->ac_etime += delta;
+
+	task_cputime(tsk, &utime, &stime);
+	stats->ac_utime += div_u64(utime, NSEC_PER_USEC);
+	stats->ac_stime += div_u64(stime, NSEC_PER_USEC);
+
+	stats->nvcsw += tsk->nvcsw;
+	stats->nivcsw += tsk->nivcsw;
+}
+
 static int fill_stats_for_tgid(pid_t tgid, struct taskstats *stats)
 {
 	struct task_struct *tsk, *first;
 	unsigned long flags;
 	int rc = -ESRCH;
-	u64 delta, utime, stime;
-	u64 start_time;
+	u64 now_ns;
 
 	/*
 	 * Add additional stats from live tasks except zombie thread group
@@ -233,30 +259,12 @@ static int fill_stats_for_tgid(pid_t tgid, struct taskstats *stats)
 	else
 		memset(stats, 0, sizeof(*stats));
 
-	start_time = ktime_get_ns();
+	now_ns = ktime_get_ns();
 	for_each_thread(first, tsk) {
 		if (tsk->exit_state)
 			continue;
-		/*
-		 * Accounting subsystem can call its functions here to
-		 * fill in relevant parts of struct taskstsats as follows
-		 *
-		 *	per-task-foo(stats, tsk);
-		 */
-		delayacct_add_tsk(stats, tsk);
-
-		/* calculate task elapsed time in nsec */
-		delta = start_time - tsk->start_time;
-		/* Convert to micro seconds */
-		do_div(delta, NSEC_PER_USEC);
-		stats->ac_etime += delta;
 
-		task_cputime(tsk, &utime, &stime);
-		stats->ac_utime += div_u64(utime, NSEC_PER_USEC);
-		stats->ac_stime += div_u64(stime, NSEC_PER_USEC);
-
-		stats->nvcsw += tsk->nvcsw;
-		stats->nivcsw += tsk->nivcsw;
+		tgid_stats_add_task(stats, tsk, now_ns);
 	}
 
 	unlock_task_sighand(first, &flags);
@@ -275,18 +283,14 @@ static int fill_stats_for_tgid(pid_t tgid, struct taskstats *stats)
 static void fill_tgid_exit(struct task_struct *tsk)
 {
 	unsigned long flags;
+	u64 now_ns;
 
 	spin_lock_irqsave(&tsk->sighand->siglock, flags);
 	if (!tsk->signal->stats)
 		goto ret;
 
-	/*
-	 * Each accounting subsystem calls its functions here to
-	 * accumalate its per-task stats for tsk, into the per-tgid structure
-	 *
-	 *	per-task-foo(tsk->signal->stats, tsk);
-	 */
-	delayacct_add_tsk(tsk->signal->stats, tsk);
+	now_ns = ktime_get_ns();
+	tgid_stats_add_task(tsk->signal->stats, tsk, now_ns);
 ret:
 	spin_unlock_irqrestore(&tsk->sighand->siglock, flags);
 	return;
-- 
2.43.0
Re: [PATCH] taskstats: retain dead thread stats in TGID queries
Posted by wang.yaxin@zte.com.cn 3 days, 4 hours ago
>fill_stats_for_tgid() builds TGID stats from two sources: the cached
>aggregate in signal->stats and a scan of the live threads in the group.
>
>However, fill_tgid_exit() only accumulates delay accounting into
>signal->stats. This means TGID queries lose the fields that
>fill_stats_for_tgid() adds for live threads once a thread exits,
>including ac_etime, ac_utime, ac_stime, nvcsw and nivcsw.
>
>Factor the per-task TGID accumulation into a helper and use it in both
>fill_stats_for_tgid() and fill_tgid_exit(). This keeps the fields
>retained for dead threads aligned with the fields already accounted for
>live threads, and follows the existing taskstats TGID aggregation model,
>which already accumulates delay accounting in fill_tgid_exit() and
>combines it with a live-thread scan in fill_stats_for_tgid().

1.Capturing the final delay-accounting update before task exit is necessary;
otherwise, cumulative task delay will be underestimated over time.

2.For fixes of this kind, it is best to include the issue/bug ID that is being fixed,
so the change remains clearly traceable.

Thanks,
Yaxin
Re: [PATCH] taskstats: retain dead thread stats in TGID queries
Posted by Yiyang Chen 2 days, 22 hours ago
Hi Yaxin,

Thanks for the review.

> 1. Capturing the final delay-accounting update before task exit is necessary;
> otherwise, cumulative task delay will be underestimated over time.

Yes, exactly. My patch preserves this behavior — delay accounting was already
being accumulated in fill_tgid_exit() via delayacct_add_tsk(). What I'm doing
is extending this same pattern to other fields (ac_etime, ac_utime, ac_stime,
nvcsw, nivcsw) that were only accumulated for live threads but not for exited
threads, causing data loss in TGID queries.

> 2. For fixes of this kind, it is best to include the issue/bug ID that is being fixed,
> so the change remains clearly traceable.

Thanks. There is no formal issue/bug ID for this problem at the moment.
If needed, I can try to report one and update the patch accordingly.

Thanks,
Yiyang Chen
Re: [PATCH] taskstats: retain dead thread stats in TGID queries
Posted by Dr. Thomas Orgis 4 days, 2 hours ago
Am Fri, 27 Mar 2026 03:12:07 +0800
schrieb Yiyang Chen <cyyzero16@gmail.com>:

> However, fill_tgid_exit() only accumulates delay accounting into
> signal->stats. This means TGID queries lose the fields that
> fill_stats_for_tgid() adds for live threads once a thread exits,
> including ac_etime, ac_utime, ac_stime, nvcsw and nivcsw.

Seems like you are properly tackling the problem I as an outsider had
when I started out using the taskstats interface, quoting my
task/process accounting tool sources (from around 2018):

 * I intended to only count processes (tgid stats), but that
 * gives empty values for the ones I am interested in. There was
 * a patch posted ages ago that would have added the accounting
 * fields in the aggregation ... but did not make it, apparently.
 * Linux kernel folks are interested in stuff like that delay
 * accounting (not sure I know what this is about), while I want
 * a reliable way to add up the compute/memory resources used
 * by certain processes.

Thanks!

I was missing the fields interesting to me in tgid stats: (CPU) times,
I/O, memory. I am not sure if I get around testing your patch qickly
due to personal time and brain-time constraints, but I want to express
interest in it.

I ended up adding the AGROUP flag in taskstats version 12 to solve my
issue, allowing my tool to tell exit of an individual thread from exit
of a group. Though this means I have to get all individual thread stats
from the kernel and later sort them into aggregates per process.

In the end I want to present such to users (percentages on sums for the
whole HPC batch job):

-------------------8<---------------
  cpu   mem     io │ maxrss  maxvm │ tasks  procs │ command
    %     %      % │    GiB    GiB │              │        
═══════════════════╪═══════════════╪══════════════╪════════
100.0  99.9  100.0 │    2.6    3.5 │   576    192 │ some_program

Summary:

    Elapsed time:  38%   (9.1 out of 24.0 h timelimit)                 
             CPU: 100% (191.7 out of 192 physical CPU cores)           
Max. main memory:  37% (273.9 out of 750.0 GiB min. available per node)
----------------->8-----------------

I can discern that this was a structurally simple (MPI) program that
spawned one process per CPU core and probably had two extra threads per
core for communication. It allocated 34 % more memory than it actually
needed. This one program took so much of the job's resources that other
processes don't really count. A bad HPC job has a long table of
commands each contributing a little, down towards individual calls to
'cat' and the like. I want to see and present those cases.

In another application, I collect statistics using accumulated CPU time
and coremem per program binary to be able to tell which programs and
(older) versions use how much of our cluster over the years.

With a counter for total tasks over the group lifetime added to struct
taskstats and the missing fields filled following your patch, I could
get all this information with a lot less overhead via datasets only on
tgid exit and would not have to count each task as it finishes. I
always like less overhead for monitoring/accounting!

> Factor the per-task TGID accumulation into a helper and use it in both
> fill_stats_for_tgid() and fill_tgid_exit(). This keeps the fields
> retained for dead threads aligned with the fields already accounted for
> live threads, and follows the existing taskstats TGID aggregation model,
> which already accumulates delay accounting in fill_tgid_exit() and
> combines it with a live-thread scan in fill_stats_for_tgid().

Pardon my ignorance, as I do not have the time right now to dive back
into kernel code: Should other fields of interest also be filled? Do we
have all of them covered? Memory highwater marks are not per-task,
right? But coremem, virtmem? I/O stats?

Also, in the end, I'd strongly prefer this patch to include a
user-visible change in the API, like an increased TASKSTATS_VERSION.
There are no new fields added, but the interpretation of the data is
different now for tgid.


Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg
Re: [PATCH] taskstats: retain dead thread stats in TGID queries
Posted by Yiyang Chen 2 days, 23 hours ago
Hi Dr. Thomas

> I can discern that this was a structurally simple (MPI) program that
> spawned one process per CPU core and probably had two extra threads per
> core for communication. It allocated 34 % more memory than it actually
> needed. This one program took so much of the job's resources that other
> processes don't really count. A bad HPC job has a long table of
> commands each contributing a little, down towards individual calls to
> 'cat' and the like. I want to see and present those cases.
>
> In another application, I collect statistics using accumulated CPU time
> and coremem per program binary to be able to tell which programs and
> (older) versions use how much of our cluster over the years.
>
> With a counter for total tasks over the group lifetime added to struct
> taskstats and the missing fields filled following your patch, I could
> get all this information with a lot less overhead via datasets only on
> tgid exit and would not have to count each task as it finishes. I
> always like less overhead for monitoring/accounting!

Thanks a lot for the detailed feedback and for sharing your use case!

> > Factor the per-task TGID accumulation into a helper and use it in both
> > fill_stats_for_tgid() and fill_tgid_exit(). This keeps the fields
> > retained for dead threads aligned with the fields already accounted for
> > live threads, and follows the existing taskstats TGID aggregation model,
> > which already accumulates delay accounting in fill_tgid_exit() and
> > combines it with a live-thread scan in fill_stats_for_tgid().
>
> Pardon my ignorance, as I do not have the time right now to dive back
> into kernel code: Should other fields of interest also be filled? Do we
> have all of them covered? Memory highwater marks are not per-task,
> right? But coremem, virtmem? I/O stats?

You're right that my current patch only covers
ac_etime/ac_utime/ac_stime/nvcsw/nivcsw and delay accounting.
I focused on these fields that were already accumulated in
fill_stats_for_tgid() for live threads, to fix the inconsistency
where dead threads lost accumulation in TGID queries.
Also unify the fields for TGID queries and exit notifications,
and ensure that dead threads are correctly counted.
But adding the other fields makes sense as a follow-up patch.
This may require a minor refactoring to reuse some of the code
for PID taskstats accounting.

> Also, in the end, I'd strongly prefer this patch to include a
> user-visible change in the API, like an increased TASKSTATS_VERSION.
> There are no new fields added, but the interpretation of the data is
> different now for tgid.
My current thinking is not to bump TASKSTATS_VERSION,
since the struct layout and fields are unchanged.
But if maintainers think the semantic change should be versioned,
I’m happy to do that.

Thanks,
Yiyang Chen