[PATCH v7 09/23] sched: Fix runtime accounting w/ split exec & sched contexts

John Stultz posted 23 patches 1 year, 12 months ago
[PATCH v7 09/23] sched: Fix runtime accounting w/ split exec & sched contexts
Posted by John Stultz 1 year, 12 months ago
The idea here is we want to charge the scheduler-context task's
vruntime but charge the execution-context task's sum_exec_runtime.

This way cputime accounting goes against the task actually running
but vruntime accounting goes against the selected task so we get
proper fairness.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/fair.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07216ea3ed53..085941db5bf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1129,22 +1129,35 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
 
-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now = rq_clock_task(rq);
 	s64 delta_exec;
 
-	delta_exec = now - curr->exec_start;
+	/* Calculate the delta from selected se */
+	delta_exec = now - se->exec_start;
 	if (unlikely(delta_exec <= 0))
 		return delta_exec;
 
-	curr->exec_start = now;
-	curr->sum_exec_runtime += delta_exec;
+	/* Update selected se's exec_start */
+	se->exec_start = now;
+	if (entity_is_task(se)) {
+		struct task_struct *running = rq->curr;
+		/*
+		 * If se is a task, we account the time against the running
+		 * task, as w/ proxy-exec they may not be the same.
+		 */
+		running->se.exec_start = now;
+		running->se.sum_exec_runtime += delta_exec;
+	} else {
+		/* If not task, account the time against se */
+		se->sum_exec_runtime += delta_exec;
+	}
 
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
 
-		stats = __schedstats_from_se(curr);
+		stats = __schedstats_from_se(se);
 		__schedstat_set(stats->exec_max,
 				max(delta_exec, stats->exec_max));
 	}
-- 
2.43.0.472.g3155946c3a-goog
Re: [PATCH v7 09/23] sched: Fix runtime accounting w/ split exec & sched contexts
Posted by Valentin Schneider 1 year, 11 months ago
(I did a reply instead of a reply-all, sorry John you're getting this one twice!)

On 19/12/23 16:18, John Stultz wrote:
> The idea here is we want to charge the scheduler-context task's
> vruntime but charge the execution-context task's sum_exec_runtime.
>
> This way cputime accounting goes against the task actually running
> but vruntime accounting goes against the selected task so we get
> proper fairness.

This looks like the right approach, especially when it comes to exposing
data to userspace as with e.g. top.

I did however get curious as to what would be the impact of not updating
the donor's sum_exec_runtime. A quick look through fair.c shows these
function using it:
  - numa_get_avg_runtime()
  - task_numa_work()
  - task_tick_numa()
  - set_next_entity()
  - hrtick_start_fair()

The NUMA ones shouldn't matter too much, as they care about the actually
running task, which is the one that gets its sum_exec_runtime increased.
task_tick_numa() needs to be changed though, as it should be passed the
currently running task, not the selected (donor) one, but shouldn't need
any other change (famous last words).

Generally I think all of the NUMA balancing stats stuff shouldn't care
about the donor task, as the pages being accessed are part of the execution
context.

The hrtick one is tricky. AFAICT since we don't update the donor's
sum_exec_runtime, in proxy scenarios we'll end up always programming the
hrtimer to the entire extent of the donor's slice, which might not be
correct. Considering the HRTICK SCHED_FEAT defaults to disabled, that could
be left as a TODO.