From nobody Tue Jun 30 19:06:00 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EE09C433F5 for ; Tue, 11 Jan 2022 09:56:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1347438AbiAKJ4P (ORCPT ); Tue, 11 Jan 2022 04:56:15 -0500 Received: from out30-43.freemail.mail.aliyun.com ([115.124.30.43]:42642 "EHLO out30-43.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237188AbiAKJ4M (ORCPT ); Tue, 11 Jan 2022 04:56:12 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=cruzzhao@linux.alibaba.com;NM=1;PH=DS;RN=15;SR=0;TI=SMTPD_---0V1ZVy34_1641894961; Received: from AliYun.localdomain(mailfrom:CruzZhao@linux.alibaba.com fp:SMTPD_---0V1ZVy34_1641894961) by smtp.aliyun-inc.com(127.0.0.1); Tue, 11 Jan 2022 17:56:08 +0800 From: Cruz Zhao To: tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, joshdon@google.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 1/3] sched/core: Accounting forceidle time for all tasks except idle task Date: Tue, 11 Jan 2022 17:55:59 +0800 Message-Id: <1641894961-9241-2-git-send-email-CruzZhao@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1641894961-9241-1-git-send-email-CruzZhao@linux.alibaba.com> References: <1641894961-9241-1-git-send-email-CruzZhao@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" There are two types of forced idle time: forced idle time from cookie'd task and forced idle time form uncookie'd task. The forced idle time from uncookie'd task is actually caused by the cookie'd task in runqueue indirectly, and it's more accurate to measure the capacity loss with the sum of both. Assuming cpu x and cpu y are a pair of SMT siblings, consider the following scenarios: 1.There's a cookie'd task running on cpu x, and there're 4 uncookie'd tasks running on cpu y. For cpu x, there will be 80% forced idle time (from uncookie'd task); for cpu y, there will be 20% forced idle time (from cookie'd task). 2.There's a uncookie'd task running on cpu x, and there're 4 cookie'd tasks running on cpu y. For cpu x, there will be 80% forced idle time (from cookie'd task); for cpu y, there will be 20% forced idle time (from uncookie'd task). The scenario1 can recurrent by stress-ng(scenario2 can recurrent similary): (cookie'd)taskset -c x stress-ng -c 1 -l 100 (uncookie'd)taskset -c y stress-ng -c 4 -l 100 In the above two scenarios, the total capacity loss is 1 cpu, but in scenario1, the cookie'd forced idle time tells us 20% cpu capacity loss, in scenario2, the cookie'd forced idle time tells us 80% cpu capacity loss, which are not accurate. It'll be more accurate to measure with cookie'd forced idle time and uncookie'd forced idle time. Signed-off-by: Cruz Zhao Reviewed-by: Josh Don --- kernel/sched/core.c | 3 +-- kernel/sched/core_sched.c | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2e4ae00..e8187e7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5822,8 +5822,7 @@ static inline struct task_struct *pick_task(struct rq= *rq) } =20 if (schedstat_enabled() && rq->core->core_forceidle_count) { - if (cookie) - rq->core->core_forceidle_start =3D rq_clock(rq->core); + rq->core->core_forceidle_start =3D rq_clock(rq->core); rq->core->core_forceidle_occupation =3D occ; } =20 diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c index 1fb4567..c8746a9 100644 --- a/kernel/sched/core_sched.c +++ b/kernel/sched/core_sched.c @@ -277,7 +277,7 @@ void __sched_core_account_forceidle(struct rq *rq) rq_i =3D cpu_rq(i); p =3D rq_i->core_pick ?: rq_i->curr; =20 - if (!p->core_cookie) + if (p =3D=3D rq_i->idle) continue; =20 __schedstat_add(p->stats.core_forceidle_sum, delta); --=20 1.8.3.1 From nobody Tue Jun 30 19:06:00 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A1E6C433FE for ; Tue, 11 Jan 2022 09:56:17 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348238AbiAKJ4Q (ORCPT ); Tue, 11 Jan 2022 04:56:16 -0500 Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:57500 "EHLO out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237268AbiAKJ4N (ORCPT ); Tue, 11 Jan 2022 04:56:13 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R131e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04423;MF=cruzzhao@linux.alibaba.com;NM=1;PH=DS;RN=15;SR=0;TI=SMTPD_---0V1ZVy34_1641894961; Received: from AliYun.localdomain(mailfrom:CruzZhao@linux.alibaba.com fp:SMTPD_---0V1ZVy34_1641894961) by smtp.aliyun-inc.com(127.0.0.1); Tue, 11 Jan 2022 17:56:09 +0800 From: Cruz Zhao To: tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, joshdon@google.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 2/3] sched/core: Forced idle accounting per-cpu Date: Tue, 11 Jan 2022 17:56:00 +0800 Message-Id: <1641894961-9241-3-git-send-email-CruzZhao@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1641894961-9241-1-git-send-email-CruzZhao@linux.alibaba.com> References: <1641894961-9241-1-git-send-email-CruzZhao@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Accounting for "forced idle" time per cpu, which is the time a cpu is forced to idle by its SMT sibling. As it's not accurate to measure the capacity loss only by cookie'd forced idle time, and it's hard to trace the forced idle time caused by all the uncookie'd tasks, we account the forced idle time from the perspective of the cpu. Forced idle time per cpu is displayed via /proc/schedstat, we can get the forced idle time of cpu x from the 10th column of the row for cpu x. The unit is ns. It also requires that schedstats is enabled. Signed-off-by: Cruz Zhao --- kernel/sched/core.c | 7 ++++++- kernel/sched/core_sched.c | 7 +++++-- kernel/sched/sched.h | 4 ++++ kernel/sched/stats.c | 17 +++++++++++++++-- 4 files changed, 30 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e8187e7..a224b916 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -285,8 +285,10 @@ static void __sched_core_flip(bool enabled) =20 sched_core_lock(cpu, &flags); =20 - for_each_cpu(t, smt_mask) + for_each_cpu(t, smt_mask) { cpu_rq(t)->core_enabled =3D enabled; + cpu_rq(t)->in_forcedidle =3D false; + } =20 cpu_rq(cpu)->core->core_forceidle_start =3D 0; =20 @@ -5690,6 +5692,7 @@ static inline struct task_struct *pick_task(struct rq= *rq) * another cpu during offline. */ rq->core_pick =3D NULL; + rq->in_forcedidle =3D false; return __pick_next_task(rq, prev, rf); } =20 @@ -5810,9 +5813,11 @@ static inline struct task_struct *pick_task(struct r= q *rq) =20 rq_i->core_pick =3D p; =20 + rq->in_forcedidle =3D false; if (p =3D=3D rq_i->idle) { if (rq_i->nr_running) { rq->core->core_forceidle_count++; + rq_i->in_forcedidle =3D true; if (!fi_before) rq->core->core_forceidle_seq++; } diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c index c8746a9..fe04805 100644 --- a/kernel/sched/core_sched.c +++ b/kernel/sched/core_sched.c @@ -242,7 +242,7 @@ int sched_core_share_pid(unsigned int cmd, pid_t pid, e= num pid_type type, void __sched_core_account_forceidle(struct rq *rq) { const struct cpumask *smt_mask =3D cpu_smt_mask(cpu_of(rq)); - u64 delta, now =3D rq_clock(rq->core); + u64 delta_per_idlecpu, delta, now =3D rq_clock(rq->core); struct rq *rq_i; struct task_struct *p; int i; @@ -254,7 +254,7 @@ void __sched_core_account_forceidle(struct rq *rq) if (rq->core->core_forceidle_start =3D=3D 0) return; =20 - delta =3D now - rq->core->core_forceidle_start; + delta_per_idlecpu =3D delta =3D now - rq->core->core_forceidle_start; if (unlikely((s64)delta <=3D 0)) return; =20 @@ -277,6 +277,9 @@ void __sched_core_account_forceidle(struct rq *rq) rq_i =3D cpu_rq(i); p =3D rq_i->core_pick ?: rq_i->curr; =20 + if (rq_i->in_forcedidle) + rq->rq_forceidle_time +=3D delta_per_idlecpu; + if (p =3D=3D rq_i->idle) continue; =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index de53be9..9377d91 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1086,6 +1086,9 @@ struct rq { /* try_to_wake_up() stats */ unsigned int ttwu_count; unsigned int ttwu_local; +#ifdef CONFIG_SCHED_CORE + u64 rq_forceidle_time; +#endif #endif =20 #ifdef CONFIG_CPU_IDLE @@ -1115,6 +1118,7 @@ struct rq { unsigned int core_forceidle_seq; unsigned int core_forceidle_occupation; u64 core_forceidle_start; + bool in_forcedidle; #endif }; =20 diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c index 07dde29..ea22a8c 100644 --- a/kernel/sched/stats.c +++ b/kernel/sched/stats.c @@ -108,6 +108,16 @@ void __update_stats_enqueue_sleeper(struct rq *rq, str= uct task_struct *p, } } =20 +#ifdef CONFIG_SCHED_CORE +static inline u64 get_rq_forceidle_time(struct rq *rq) { + return rq->rq_forceidle_time; +} +#else +static inline u64 get_rq_forceidle_time(struct rq *rq) { + return 0; +} +#endif + /* * Current schedstat API version. * @@ -125,21 +135,24 @@ static int show_schedstat(struct seq_file *seq, void = *v) seq_printf(seq, "timestamp %lu\n", jiffies); } else { struct rq *rq; + u64 rq_forceidle_time; #ifdef CONFIG_SMP struct sched_domain *sd; int dcount =3D 0; #endif cpu =3D (unsigned long)(v - 2); rq =3D cpu_rq(cpu); + rq_forceidle_time =3D get_rq_forceidle_time(rq); =20 /* runqueue-specific stats */ seq_printf(seq, - "cpu%d %u 0 %u %u %u %u %llu %llu %lu", + "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu", cpu, rq->yld_count, rq->sched_count, rq->sched_goidle, rq->ttwu_count, rq->ttwu_local, rq->rq_cpu_time, - rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount); + rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount, + rq_forceidle_time); =20 seq_printf(seq, "\n"); =20 --=20 1.8.3.1 From nobody Tue Jun 30 19:06:00 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0CC23C433F5 for ; Tue, 11 Jan 2022 09:56:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1348278AbiAKJ4T (ORCPT ); Tue, 11 Jan 2022 04:56:19 -0500 Received: from out30-45.freemail.mail.aliyun.com ([115.124.30.45]:52043 "EHLO out30-45.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237328AbiAKJ4O (ORCPT ); Tue, 11 Jan 2022 04:56:14 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R891e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=cruzzhao@linux.alibaba.com;NM=1;PH=DS;RN=15;SR=0;TI=SMTPD_---0V1ZVy34_1641894961; Received: from AliYun.localdomain(mailfrom:CruzZhao@linux.alibaba.com fp:SMTPD_---0V1ZVy34_1641894961) by smtp.aliyun-inc.com(127.0.0.1); Tue, 11 Jan 2022 17:56:10 +0800 From: Cruz Zhao To: tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, joshdon@google.com Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 3/3] sched/core: Force idle accounting per cgroup Date: Tue, 11 Jan 2022 17:56:01 +0800 Message-Id: <1641894961-9241-4-git-send-email-CruzZhao@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1641894961-9241-1-git-send-email-CruzZhao@linux.alibaba.com> References: <1641894961-9241-1-git-send-email-CruzZhao@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Accounting for "force idle" time per cgroup, which is the time the tasks of the cgroup forced its SMT siblings into idle. Force idle time per cgroup is displayed via /sys/fs/cgroup/cpuacct/$cg/cpuacct.forceidle. Force idle time per cgroup per cpu is displayed via /sys/fs/cgroup/cpuacct/$cg/cpuacct.forceidle_percpu. The unit is ns. It also requires that schedstats is enabled. We can get the total system forced idle time by looking at the root cgroup, and we can get how long the cgroup forced it SMT siblings into idle. If the force idle time of a cgroup is high, that can be rectified by making some changes(ie. affinity, cpu budget, etc.) to the cgroup. Signed-off-by: Cruz Zhao --- include/linux/cgroup.h | 7 +++++ kernel/sched/core_sched.c | 1 + kernel/sched/cpuacct.c | 79 +++++++++++++++++++++++++++++++++++++++++++= ++++ 3 files changed, 87 insertions(+) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 75c1514..0c1b616 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -774,10 +774,17 @@ static inline struct cgroup *cgroup_get_from_id(u64 i= d) #ifdef CONFIG_CGROUP_CPUACCT void cpuacct_charge(struct task_struct *tsk, u64 cputime); void cpuacct_account_field(struct task_struct *tsk, int index, u64 val); +#ifdef CONFIG_SCHED_CORE +void cpuacct_account_forceidle(int cpu, struct task_struct *task, u64 cput= ime); +#endif #else static inline void cpuacct_charge(struct task_struct *tsk, u64 cputime) {} static inline void cpuacct_account_field(struct task_struct *tsk, int inde= x, u64 val) {} +#ifdef CONFIG_SCHED_CORE +static inline void cpuacct_account_forceidle(int cpu, struct task_struct *= task, + u64 cputime) {} +#endif #endif =20 void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec); diff --git a/kernel/sched/core_sched.c b/kernel/sched/core_sched.c index fe04805..add8672 100644 --- a/kernel/sched/core_sched.c +++ b/kernel/sched/core_sched.c @@ -284,6 +284,7 @@ void __sched_core_account_forceidle(struct rq *rq) continue; =20 __schedstat_add(p->stats.core_forceidle_sum, delta); + cpuacct_account_forceidle(i, p, delta); } } =20 diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c index 3d06c5e..b5c5d99 100644 --- a/kernel/sched/cpuacct.c +++ b/kernel/sched/cpuacct.c @@ -27,6 +27,9 @@ struct cpuacct { /* cpuusage holds pointer to a u64-type object on every CPU */ u64 __percpu *cpuusage; struct kernel_cpustat __percpu *cpustat; +#ifdef CONFIG_SCHED_CORE + u64 __percpu *forceidle; +#endif }; =20 static inline struct cpuacct *css_ca(struct cgroup_subsys_state *css) @@ -46,9 +49,15 @@ static inline struct cpuacct *parent_ca(struct cpuacct *= ca) } =20 static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage); +#ifdef CONFIG_SCHED_CORE +static DEFINE_PER_CPU(u64, root_cpuacct_forceidle); +#endif static struct cpuacct root_cpuacct =3D { .cpustat =3D &kernel_cpustat, .cpuusage =3D &root_cpuacct_cpuusage, +#ifdef CONFIG_SCHED_CORE + .forceidle =3D &root_cpuacct_forceidle, +#endif }; =20 /* Create a new CPU accounting group */ @@ -72,8 +81,18 @@ static inline struct cpuacct *parent_ca(struct cpuacct *= ca) if (!ca->cpustat) goto out_free_cpuusage; =20 +#ifdef CONFIG_SCHED_CORE + ca->forceidle =3D alloc_percpu(u64); + if (!ca->forceidle) + goto out_free_cpustat; +#endif + return &ca->css; =20 +#ifdef CONFIG_SCHED_CORE +out_free_cpustat: + free_percpu(ca->cpustat); +#endif out_free_cpuusage: free_percpu(ca->cpuusage); out_free_ca: @@ -290,6 +309,37 @@ static int cpuacct_stats_show(struct seq_file *sf, voi= d *v) return 0; } =20 +#ifdef CONFIG_SCHED_CORE +static u64 __forceidle_read(struct cpuacct *ca, int cpu) +{ + return *per_cpu_ptr(ca->forceidle, cpu); +} +static int cpuacct_percpu_forceidle_seq_show(struct seq_file *m, void *V) +{ + struct cpuacct *ca =3D css_ca(seq_css(m)); + u64 percpu; + int i; + + for_each_possible_cpu(i) { + percpu =3D __forceidle_read(ca, i); + seq_printf(m, "%llu ", (unsigned long long) percpu); + } + seq_printf(m, "\n"); + return 0; +} +static u64 cpuacct_forceidle_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct cpuacct *ca =3D css_ca(css); + u64 totalforceidle =3D 0; + int i; + + for_each_possible_cpu(i) + totalforceidle +=3D __forceidle_read(ca, i); + return totalforceidle; +} +#endif + static struct cftype files[] =3D { { .name =3D "usage", @@ -324,6 +374,16 @@ static int cpuacct_stats_show(struct seq_file *sf, voi= d *v) .name =3D "stat", .seq_show =3D cpuacct_stats_show, }, +#ifdef CONFIG_SCHED_CORE + { + .name =3D "forceidle", + .read_u64 =3D cpuacct_forceidle_read, + }, + { + .name =3D "forceidle_percpu", + .seq_show =3D cpuacct_percpu_forceidle_seq_show, + }, +#endif { } /* terminate */ }; =20 @@ -359,6 +419,25 @@ void cpuacct_account_field(struct task_struct *tsk, in= t index, u64 val) rcu_read_unlock(); } =20 +#ifdef CONFIG_SCHED_CORE +void cpuacct_account_forceidle(int cpu, struct task_struct *tsk, u64 cputi= me) +{ + struct cpuacct *ca; + u64 *fi; + + rcu_read_lock(); + /* + * We have hold rq->core->__lock here, which protects ca->forceidle + * percpu. + */ + for (ca =3D task_ca(tsk); ca; ca =3D parent_ca(ca)) { + fi =3D per_cpu_ptr(ca->forceidle, cpu); + *fi +=3D cputime; + } + rcu_read_unlock(); +} +#endif + struct cgroup_subsys cpuacct_cgrp_subsys =3D { .css_alloc =3D cpuacct_css_alloc, .css_free =3D cpuacct_css_free, --=20 1.8.3.1