From nobody Tue Dec 16 22:14:05 2025 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 73BA745BEC for ; Sun, 9 Feb 2025 06:14:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739081643; cv=none; b=PI+qwy67i1d6ppjub/oJlXKPSYL0e4hvy9y9b9hMYXMop4FY3YpJbvOjf4Vr4+jh+Htl+ozo4aySStSDJRqLXTOSFSAC1ThEE2V2Wv//MYa+MjRKGBw4k03QDreDgPM6UAnHbwybGkIMUt+WaNS7aBpLVVlsdcDDSxs5oYIAmWI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739081643; c=relaxed/simple; bh=bkK0CH511DnkVfmrMeGvewdUCLteCTKCIflo07f3QTQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=STkB4iO9AdCPSvoUDIoAINF7i2BZj3unwXro0fjdZvSk37iQD5qFy+2gxhJpQJskxE2Fi/DWBGAloSldHatjkpiIAXKAMWjzwGeVqn/Y5A+BIiJIkr7VEQT9hwHqVD5L4GSXZUN6ThiRYdOy+vn6BhR64Xb+e49nTZIy/4IDjF0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=B42QOgje; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="B42QOgje" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-2f2f5e91393so885172a91.0 for ; Sat, 08 Feb 2025 22:14:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1739081641; x=1739686441; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=BOjIKAHCQDSMAXA0ImMg5Ea5nk/p2DbggDGTVuT7g2A=; b=B42QOgjePMHmyQ5YbvEPt2FYglBTm9YZUucWK5Yh/YW7jFn7NwQzC3+fF06eoTh8xi TqPz+ZLmJYkM9/AtQwc/i6uO7twy99+Adt+ltN6eWonssK5pkq+qcs3V78a44UbFD15l qFCSfUd1j0qw6w3OjOeDAN6D8alimxojN30ltmn992PZdAJ0INsPOZbbRcWC0tQfkhMm nm9g1FCicyNlUnio8cz+oSU7PIUtpDQt2ZHv7vneeUCdAFLTe9UELfDdWrb1v+7ooqy2 WHwrBwHurbGRuAb17Uj832CIpEVzmw0fjjLkBElpbjtLoiL8gEfg3jNaf4FXBiax+ZCs s0Cw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739081641; x=1739686441; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BOjIKAHCQDSMAXA0ImMg5Ea5nk/p2DbggDGTVuT7g2A=; b=FfrvwUBj8G6VbMW4s5D4sDSONxVzTzhvdQvjcs5l4T3aOaMZMuEmJyqQRrU/v/0BI/ XHaO/ARVKsDxXALHVfc9684oAOZU3QTCnfCaY2S/APpUjkugbbDBY8ogkPAjMN7HWL3n HUNc2lccrnG8M1TA7PRFNcfUNPGFzhUwO1RwHOu3DlcQkpwejL6QT4U8jcI5n+WP8zlD Tw2ux3qNbSdaWeLuYKr9S5ViQto5r8I9KOcGrgOCZnANfX9tYZOxoJEwqrf7ti0GAazr 3eAZCUFiwFVjbL7YdHv9XwlyQZVqW67SnLIjfRkC2Vmpbc4mRjA0v46Fks9bS8Wzxu1v KImA== X-Forwarded-Encrypted: i=1; AJvYcCXV/K7mc1ezSDjYjp/+lSX6gO388jj5t4qbdCM6V16EBXrL5EKd1ZPZLxNBvC/vCAG47WNVnEmXDWJHnsE=@vger.kernel.org X-Gm-Message-State: AOJu0YzGEjHDFJXOM008Y5+7FpXAOuyCwI+wX/JzbaHvRisP6deHTHGY B14s70SLSmbF50GG6Z0GpKJri6VdRoGq3f6toITpQQkQcQqrUpm03vbGpmzDdr0= X-Gm-Gg: ASbGncsWAGYff7TusqgqgIkLLG8PfdKR9i1FATC9ET0Je7ASFM5eurUQWHwoFGo6mEO Tdft/R2pvRX8Fht9KRjoVZVoCT7NSF7S552/qTxzNHqRyHgfy2B5M3GI0Wzv3Tkgj6gTH3xOQ6y f+Eo4hdrAZODGKQTr2hJKjbRodR2WJ5XIAHXX1K10ZLJJpwjBxTpgtgDXmMNAcYSVuaPuUfOgEn 34Kw6frj3Sf2CWAIoMhoiz1RuoFoKcgOzAjAV2EfZh1Y/jb1nZATHnUz6EaLe6OLw2xDZsASKT1 k/Hiti8i3e++oQ6HxxIumUSwDtohjfKQk+BuwgBuBDiXMjBmAIzxmg== X-Google-Smtp-Source: AGHT+IFCkTpbKKfaRvMohnnB70rMnRi0gOsnSx4WPZ7U4X4EyrXnKy0KBUsp7VDUscCluFBftaYktA== X-Received: by 2002:a05:6a00:1884:b0:730:271e:3c17 with SMTP id d2e1a72fcca58-7305d5a315emr5776709b3a.7.1739081640689; Sat, 08 Feb 2025 22:14:00 -0800 (PST) Received: from C02DV8HUMD6R.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-73048ad26b9sm5550700b3a.50.2025.02.08.22.13.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 08 Feb 2025 22:14:00 -0800 (PST) From: Abel Wu To: Johannes Weiner , Tejun Heo , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Bitao Hu , Thomas Gleixner , Yury Norov , Abel Wu , Andrew Morton , Chen Ridong Cc: cgroups@vger.kernel.org (open list:CONTROL GROUP (CGROUP)), linux-kernel@vger.kernel.org (open list) Subject: [PATCH v4 2/2] cgroup/rstat: Add run_delay accounting for cgroups Date: Sun, 9 Feb 2025 14:13:12 +0800 Message-Id: <20250209061322.15260-3-wuyun.abel@bytedance.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20250209061322.15260-1-wuyun.abel@bytedance.com> References: <20250209061322.15260-1-wuyun.abel@bytedance.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The "some" field of cpu.pressure indicator may lose the insight into how severely one cgroup is stalled on certain cpu, because PSI tracks stall time for each cpu through: tSOME[cpu] =3D time(nr_delayed_tasks[cpu] !=3D 0) which turns nr_delayed_tasks[cpu] into boolean value. So together with this cgroup level run_delay accounting, the scheduling info of cgroups will be better illustrated. Currently the task and cpu level accounting have already been tracked through the following two holders respectively: struct task_struct::sched_info if SCHED_INFO struct rq::rq_sched_info if SCHEDSTATS When extending this to cgroups, the minimal requirement would be: root: relies on rq::rq_sched_info, hence SCHEDSTATS non-root: relies on task's, hence SCHED_INFO It might be too demanding to require both, while collecting data for root cgroup from different holders according to different configs would also be confusing and error-prone. In order to keep things simple, let us rely on the cputime infrastructure to do the accounting as the other cputimes do. Only cgroup v2 is supported and CONFIG_SCHED_INFO is required. Signed-off-by: Abel Wu --- include/linux/cgroup-defs.h | 3 +++ include/linux/kernel_stat.h | 7 +++++++ kernel/cgroup/rstat.c | 25 +++++++++++++++++++++++++ kernel/sched/cputime.c | 10 ++++++++++ kernel/sched/stats.h | 3 +++ 5 files changed, 48 insertions(+) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1b20d2d8ef7c..287366e60414 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -328,6 +328,9 @@ struct cgroup_base_stat { u64 forceidle_sum; #endif u64 ntime; +#ifdef CONFIG_SCHED_INFO + u64 run_delay; +#endif }; =20 /* diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index b97ce2df376f..ddd59fea10ad 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -29,6 +29,9 @@ enum cpu_usage_stat { CPUTIME_GUEST_NICE, #ifdef CONFIG_SCHED_CORE CPUTIME_FORCEIDLE, +#endif +#ifdef CONFIG_SCHED_INFO + CPUTIME_RUN_DELAY, #endif NR_STATS, }; @@ -141,4 +144,8 @@ extern void account_idle_ticks(unsigned long ticks); extern void __account_forceidle_time(struct task_struct *tsk, u64 delta); #endif =20 +#ifdef CONFIG_SCHED_INFO +extern void account_run_delay_time(struct task_struct *tsk, u64 delta); +#endif + #endif /* _LINUX_KERNEL_STAT_H */ diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c index c2784c317cdd..53984cdf7f9b 100644 --- a/kernel/cgroup/rstat.c +++ b/kernel/cgroup/rstat.c @@ -445,6 +445,9 @@ static void cgroup_base_stat_add(struct cgroup_base_sta= t *dst_bstat, dst_bstat->forceidle_sum +=3D src_bstat->forceidle_sum; #endif dst_bstat->ntime +=3D src_bstat->ntime; +#ifdef CONFIG_SCHED_INFO + dst_bstat->run_delay +=3D src_bstat->run_delay; +#endif } =20 static void cgroup_base_stat_sub(struct cgroup_base_stat *dst_bstat, @@ -457,6 +460,9 @@ static void cgroup_base_stat_sub(struct cgroup_base_sta= t *dst_bstat, dst_bstat->forceidle_sum -=3D src_bstat->forceidle_sum; #endif dst_bstat->ntime -=3D src_bstat->ntime; +#ifdef CONFIG_SCHED_INFO + dst_bstat->run_delay -=3D src_bstat->run_delay; +#endif } =20 static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu) @@ -551,6 +557,11 @@ void __cgroup_account_cputime_field(struct cgroup *cgr= p, case CPUTIME_FORCEIDLE: rstatc->bstat.forceidle_sum +=3D delta_exec; break; +#endif +#ifdef CONFIG_SCHED_INFO + case CPUTIME_RUN_DELAY: + rstatc->bstat.run_delay +=3D delta_exec; + break; #endif default: break; @@ -596,6 +607,9 @@ static void root_cgroup_cputime(struct cgroup_base_stat= *bstat) bstat->forceidle_sum +=3D cpustat[CPUTIME_FORCEIDLE]; #endif bstat->ntime +=3D cpustat[CPUTIME_NICE]; +#ifdef CONFIG_SCHED_INFO + bstat->run_delay +=3D cpustat[CPUTIME_RUN_DELAY]; +#endif } } =20 @@ -610,6 +624,16 @@ static void cgroup_force_idle_show(struct seq_file *se= q, struct cgroup_base_stat #endif } =20 +static void cgroup_run_delay_show(struct seq_file *seq, struct cgroup_base= _stat *bstat) +{ +#ifdef CONFIG_SCHED_INFO + u64 run_delay =3D bstat->run_delay; + + do_div(run_delay, NSEC_PER_USEC); + seq_printf(seq, "run_delay_usec %llu\n", run_delay); +#endif +} + void cgroup_base_stat_cputime_show(struct seq_file *seq) { struct cgroup *cgrp =3D seq_css(seq)->cgroup; @@ -640,6 +664,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq) bstat.ntime); =20 cgroup_force_idle_show(seq, &bstat); + cgroup_run_delay_show(seq, &bstat); } =20 /* Add bpf kfuncs for cgroup_rstat_updated() and cgroup_rstat_flush() */ diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 5d9143dd0879..42af602c10a6 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -243,6 +243,16 @@ void __account_forceidle_time(struct task_struct *p, u= 64 delta) } #endif =20 +#ifdef CONFIG_SCHED_INFO +/* + * Account for run_delay time spent waiting in rq. + */ +void account_run_delay_time(struct task_struct *p, u64 delta) +{ + task_group_account_field(p, CPUTIME_RUN_DELAY, delta); +} +#endif + /* * When a guest is interrupted for a longer amount of time, missed clock * ticks are not redelivered later. Due to that, this function may on diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 19cdbe96f93d..fdfd04a89b05 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -252,7 +252,9 @@ static inline void sched_info_dequeue(struct rq *rq, st= ruct task_struct *t) t->sched_info.max_run_delay =3D delta; if (delta && (!t->sched_info.min_run_delay || delta < t->sched_info.min_r= un_delay)) t->sched_info.min_run_delay =3D delta; + rq_sched_info_dequeue(rq, delta); + account_run_delay_time(t, delta); } =20 /* @@ -279,6 +281,7 @@ static void sched_info_arrive(struct rq *rq, struct tas= k_struct *t) t->sched_info.min_run_delay =3D delta; =20 rq_sched_info_arrive(rq, delta); + account_run_delay_time(t, delta); } =20 /* --=20 2.37.3