From nobody Wed Dec 17 08:08:55 2025 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B27D8C77B75 for ; Fri, 21 Apr 2023 08:12:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229529AbjDUIMY (ORCPT ); Fri, 21 Apr 2023 04:12:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230424AbjDUIMT (ORCPT ); Fri, 21 Apr 2023 04:12:19 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F33D286BC for ; Fri, 21 Apr 2023 01:12:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1682064731; x=1713600731; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=8UHeIwHoNHSaGGLlq77jKVSJJ0dt1Yd7Oj21caRg448=; b=ZWBgi+b1BWPoJGOZzDEf90LNVVGaiAiySVLI3S6sNF/gprSdJSNE4ZBQ YOySnocQWutQD4ACdMGOmpYAu9iRTAvkEfynNpLsRkbZ1hQEGVG8CET7W YiSiwmBqjss0LBaH606xPUvtCRnnHUYgao3OCr3ms3pxaMuTG/rqs7s/w b2aatONChwzSxZ+nFJhkCUnaEgbZL8Sj3QLYeXlyHYk14bRg5fKg0LS+s mJmK2B3wBbZ1TZ2Gquuh8n1tXgPsV/QC40bN7xMaDw19zBdLozdPLC9fF g0j23OdW5JMnqPwHsMRpFk6vqsw3rK1bDylteXQMmexe7r4LrCs1k6hkz A==; X-IronPort-AV: E=McAfee;i="6600,9927,10686"; a="326269502" X-IronPort-AV: E=Sophos;i="5.99,214,1677571200"; d="scan'208";a="326269502" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Apr 2023 01:12:11 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10686"; a="722683002" X-IronPort-AV: E=Sophos;i="5.99,214,1677571200"; d="scan'208";a="722683002" Received: from chenyu-dev.sh.intel.com ([10.239.158.170]) by orsmga008.jf.intel.com with ESMTP; 21 Apr 2023 01:12:04 -0700 From: Chen Yu To: Peter Zijlstra , Vincent Guittot , Ingo Molnar , Juri Lelli Cc: Mel Gorman , Tim Chen , Dietmar Eggemann , Steven Rostedt , Ben Segall , K Prateek Nayak , Abel Wu , Yicong Yang , "Gautham R . Shenoy" , Honglei Wang , Len Brown , Chen Yu , Tianchen Ding , Joel Fernandes , Josh Don , Hillf Danton , kernel test robot , Arjan Van De Ven , Aaron Lu , linux-kernel@vger.kernel.org, Chen Yu Subject: [PATCH v7 1/2] sched/fair: Record the average duration of a task Date: Sat, 22 Apr 2023 00:07:59 +0800 Message-Id: <5bf131940d2fa1d25bddc745892ccd407beebcd6.1682060436.git.yu.c.chen@intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Record the average duration of a task, as there is a requirement to leverage this information for better task placement. At first thought the (p->se.sum_exec_runtime / p->nvcsw) can be used to measure the task duration. However, the history long past was factored too heavily in such a formula. Ideally, the old activity should decay and not affect the current status too much. Although something based on PELT can be used, se.util_avg might not be appropriate to describe the task duration: Task p1 and task p2 are doing frequent ping-pong scheduling on one CPU, both p1 and p2 have a short duration, but the util_avg can be up to 50%, which is inconsistent with task duration. It was found that there was once a similar feature to track the duration of a task: commit ad4b78bbcbab ("sched: Add new wakeup preemption mode: WAKEUP_RUNNING= ") Unfortunately, it was reverted because it was an experiment. Pick the patch up again, by recording the average duration when a task voluntarily switches out. For example, suppose on CPU1, task p1 and p2 run alternatively: --------------------> time | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks | ^ ^ ^ |_____________| |_____________________________________| ^ | p1 dequeu= ed p1's duration in one section is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run 1.5ms. This reflects the nature of a task: how long it wishes to run at most. Suggested-by: Tim Chen Suggested-by: Vincent Guittot Tested-by: K Prateek Nayak Signed-off-by: Chen Yu --- include/linux/sched.h | 3 +++ kernel/sched/core.c | 2 ++ kernel/sched/debug.c | 1 + kernel/sched/fair.c | 13 +++++++++++++ 4 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 6d654eb4cabd..f94e6aa159b0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -558,6 +558,9 @@ struct sched_entity { u64 prev_sum_exec_runtime; =20 u64 nr_migrations; + u64 prev_sleep_sum_runtime; + /* average duration of a task */ + u64 dur_avg; =20 #ifdef CONFIG_FAIR_GROUP_SCHED int depth; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d861db8aa7ab..59a6c0414a19 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4446,6 +4446,8 @@ static void __sched_fork(unsigned long clone_flags, s= truct task_struct *p) p->se.prev_sum_exec_runtime =3D 0; p->se.nr_migrations =3D 0; p->se.vruntime =3D 0; + p->se.dur_avg =3D 0; + p->se.prev_sleep_sum_runtime =3D 0; INIT_LIST_HEAD(&p->se.group_node); =20 #ifdef CONFIG_FAIR_GROUP_SCHED diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c index 1637b65ba07a..8d64fba16cfe 100644 --- a/kernel/sched/debug.c +++ b/kernel/sched/debug.c @@ -1024,6 +1024,7 @@ void proc_sched_show_task(struct task_struct *p, stru= ct pid_namespace *ns, __PS("nr_involuntary_switches", p->nivcsw); =20 P(se.load.weight); + P(se.dur_avg); #ifdef CONFIG_SMP P(se.avg.load_sum); P(se.avg.runnable_sum); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f5da01a6b35a..4af5799b90fc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6283,6 +6283,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct = *p, int flags) =20 static void set_next_buddy(struct sched_entity *se); =20 +static inline void dur_avg_update(struct task_struct *p, bool task_sleep) +{ + u64 dur; + + if (!task_sleep) + return; + + dur =3D p->se.sum_exec_runtime - p->se.prev_sleep_sum_runtime; + p->se.prev_sleep_sum_runtime =3D p->se.sum_exec_runtime; + update_avg(&p->se.dur_avg, dur); +} + /* * The dequeue_task method is called before nr_running is * decreased. We remove the task from the rbtree and @@ -6355,6 +6367,7 @@ static void dequeue_task_fair(struct rq *rq, struct t= ask_struct *p, int flags) =20 dequeue_throttle: util_est_update(&rq->cfs, p, task_sleep); + dur_avg_update(p, task_sleep); hrtick_update(rq); } =20 --=20 2.25.1