From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14AA2C32793 for ; Wed, 24 Aug 2022 08:19:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232593AbiHXITx (ORCPT ); Wed, 24 Aug 2022 04:19:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56854 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234979AbiHXITr (ORCPT ); Wed, 24 Aug 2022 04:19:47 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B718F65BF for ; Wed, 24 Aug 2022 01:19:45 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id r14-20020a17090a4dce00b001faa76931beso827352pjl.1 for ; Wed, 24 Aug 2022 01:19:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=WejlGClWGOz4bz39sTGrt+zcC+1FgtAJxwc+yuBUn+0=; b=QkzRNE8TJAoxa395U9z1bSpDOL0XmFFWXvXjkujyohp4m+IMk0dOTbr8IRZfB+ovb2 k7b9W7LyHAQWVxnKxRsyM6cNd/fkAGtvvpQ6u7eLQnaTrpryXQbY/S7sweSPWZdrQxaA p8saNle7iyLF2t6eAOkcjPWOUp0UxSPvZlY3lHPdZQUKKPo4kQpgqtjgWRbmW3TtOUXn M8CXi4zUhKJxBm22fKxSgT5gUVuzLK8kcva0GCuONUwiKgY23N0MjaBgXLIlPow7U5Wa 6cXrkM0l5rSuJuxfVnjZZgsXltSOvzHEbRHM2NnBFgCdMV4KJvxvJPlsyS3r2+2n28IY uAGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=WejlGClWGOz4bz39sTGrt+zcC+1FgtAJxwc+yuBUn+0=; b=kjpyHxcn8fkXOPbY0MfkWBkmxBmwluJK348sv2X1zkWqTx6wRRk61GydQviTH9xQ/M xwWcxRKbz/i+YOudaFBGKyvx8uh0Wql3iKuhWi/KFezyZHwEWMJqynJYrxY/Rwxn72c2 1VRYlbb0UoG4M7I74X9et7MsqpqKCVwuQP91bnnqhD3QQxmF+RyjwLKapBN7b3yAsONg nakNZWFMelTfa9nYa4/Uv1IqUYo25d9ySXJole2Ascdtjvf5RDvfBSTtdErbXnjIo+Nc /XP+PcgGAaEngZvp0CSqy70xmiUNgKGdv+DpaLG1P2NOyet+ugS25NIqiJ2opH2vCpnQ i8AQ== X-Gm-Message-State: ACgBeo0htlZyD7xY8jOL+vEBEC7zBZZdNSV14M/7hy+cCW5ZCJ4ixSnC 5zmbUgRGzmqbfaO3eGe+MjwEjA== X-Google-Smtp-Source: AA6agR40vU1zPgbtGYji72Q/9h5FA6BzjcQF5ct6JsOHHVEBNJqE1UnmsLnABo5S/ubOJm7FFC+OCw== X-Received: by 2002:a17:902:e88b:b0:172:fbf5:2e7c with SMTP id w11-20020a170902e88b00b00172fbf52e7cmr7955096plg.2.1661329185264; Wed, 24 Aug 2022 01:19:45 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.19.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:19:44 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 01/10] sched/psi: fix periodic aggregation shut off Date: Wed, 24 Aug 2022 16:18:20 +0800 Message-Id: <20220824081829.33748-2-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever. Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change(). But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer task sleep handling to psi_task_switch(), won't go through psi_task_change() anymore. So this patch move this check to psi_task_switch(). Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/sched/psi.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index ecb4b4ff4ce0..39463dcc16bb 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -796,7 +796,6 @@ void psi_task_change(struct task_struct *task, int clea= r, int set) { int cpu =3D task_cpu(task); struct psi_group *group; - bool wake_clock =3D true; void *iter =3D NULL; u64 now; =20 @@ -806,19 +805,9 @@ void psi_task_change(struct task_struct *task, int cle= ar, int set) psi_flags_change(task, clear, set); =20 now =3D cpu_clock(cpu); - /* - * Periodic aggregation shuts off if there is a period of no - * task changes, so we wake it back up if necessary. However, - * don't do this if the task change is the aggregation worker - * itself going to sleep, or we'll ping-pong forever. - */ - if (unlikely((clear & TSK_RUNNING) && - (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) =3D=3D psi_avgs_work)) - wake_clock =3D false; =20 while ((group =3D iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, now, true); } =20 void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -854,6 +843,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, =20 if (prev->pid) { int clear =3D TSK_ONCPU, set =3D 0; + bool wake_clock =3D true; =20 /* * When we're going to sleep, psi_dequeue() lets us @@ -867,13 +857,23 @@ void psi_task_switch(struct task_struct *prev, struct= task_struct *next, clear |=3D TSK_MEMSTALL_RUNNING; if (prev->in_iowait) set |=3D TSK_IOWAIT; + + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((prev->flags & PF_WQ_WORKER) && + wq_worker_last_func(prev) =3D=3D psi_avgs_work)) + wake_clock =3D false; } =20 psi_flags_change(prev, clear, set); =20 iter =3D NULL; while ((group =3D iterate_groups(prev, &iter)) && group !=3D common) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); =20 /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -882,7 +882,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, if (sleep) { clear &=3D ~TSK_ONCPU; for (; group; group =3D iterate_groups(prev, &iter)) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); } } } --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E3C8C00140 for ; Wed, 24 Aug 2022 08:19:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235521AbiHXIT5 (ORCPT ); Wed, 24 Aug 2022 04:19:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57368 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235088AbiHXITx (ORCPT ); Wed, 24 Aug 2022 04:19:53 -0400 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CD98A32D99 for ; Wed, 24 Aug 2022 01:19:51 -0700 (PDT) Received: by mail-pg1-x535.google.com with SMTP id c24so14384183pgg.11 for ; Wed, 24 Aug 2022 01:19:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=HEbBlNE74XAOeDaqxFIrSXJjdeTVN4APARuNyd0Kkrc=; b=XG5IJm+fVYs18/E3Fxu0pVrfrCgjv+nZ+l1p2BXLUiSKR3+DLxwsecj0VsQmLyKeiy u3emI9DCR+vp1DaZHoMZIDR8kuSyRe25o5OtavgfUU++zW8VMduwVMX94ebgn/HQjy9f GylD73ZClQBsEq1CUEmHq1IXpyn722i7yKTXkUF8wzFQUhtW//SStvpHw0zgSVzLOkmV ZkxUyK/cD3yRuG+ya7SyXQsEJJYXwfj9MIsH32g5D/ey71a4uKLsjfrWZnEjTmIJabon VfH8O/9u0ducGF8idUzlAYkUWr2Mc5k2gzh/FN++XngFPlag+2t86eNsnuZdglakynFc l/pA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=HEbBlNE74XAOeDaqxFIrSXJjdeTVN4APARuNyd0Kkrc=; b=JvNmGHpE6NOvwgY+lkg99VJhV3OgnRQA+Ge9hFLqEIOKKCXBHPqzwifUkxKxxkXQdU sczEhw18Wc2eM1VZtzlnqC18MHYfbb1mfU/UU2X6sji5MX/gFnivfgr21euYc4wF/mAH TZjfiiWxQhfQgpaoRQe/eUzG9VNtd2Ix36UIfop4meIQqT4QmRkdpkpb8p0z4y/uBWrR 0hivl5tsVpp0PcjipQTiqF8RLqCYeOig/S5+x7fSvLEam6mvBEeVsjyMoIeTYEIxQICp Yt93Q5dnMyXLd7E+vXNWk042mkxEP0oPK5h1rqE8hA08MHXkpgExm/OXsLcov3wluEqP NI1Q== X-Gm-Message-State: ACgBeo3BrfELBf8ApZzrygp7lM9cpfF7WODZBURVuB5Z4E3L8OxzxC+l v1UAknSotIZ1VcGgYw6l2s2mjg== X-Google-Smtp-Source: AA6agR7tzL8eaVCAjYwevawfVzwdWlzsLcP2irafzEzzW6hZqw5AJNo0GK4S+veu0lp6GXX0tnXaeQ== X-Received: by 2002:a63:5b4c:0:b0:42a:2fe:2f83 with SMTP id l12-20020a635b4c000000b0042a02fe2f83mr22495338pgm.413.1661329191386; Wed, 24 Aug 2022 01:19:51 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.19.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:19:51 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 02/10] sched/psi: don't create cgroup PSI files when psi_disabled Date: Wed, 24 Aug 2022 16:18:21 +0800 Message-Id: <20220824081829.33748-3-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking confi= gurable") make PSI can be configured to skip per-cgroup stall accounting. And doesn't expose PSI files in cgroup hierarchy. This patch do the same thing when psi_disabled. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/cgroup/cgroup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 03dbbf8a8c28..2f79ddf9a85d 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3748,6 +3748,9 @@ static void cgroup_pressure_release(struct kernfs_ope= n_file *of) =20 bool cgroup_psi_enabled(void) { + if (static_branch_likely(&psi_disabled)) + return false; + return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) =3D=3D= 0; } =20 --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9A62C00140 for ; Wed, 24 Aug 2022 08:20:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235088AbiHXIUF (ORCPT ); Wed, 24 Aug 2022 04:20:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57730 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235637AbiHXIT7 (ORCPT ); Wed, 24 Aug 2022 04:19:59 -0400 Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0AAE17FF93 for ; Wed, 24 Aug 2022 01:19:58 -0700 (PDT) Received: by mail-pl1-x632.google.com with SMTP id p18so15025736plr.8 for ; Wed, 24 Aug 2022 01:19:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=KFhoipy95bP113xci+lGCwwditMpaEg770tCZfK2FCQ=; b=GlH6POhMxkFS95xU5zJMGW8H+X23wvpWqnYzcrsJBfEAuqIfo0ZAdpRSi3bnpYan7d 41MJ816H6hy5iMnQ/1e99Mw7saohjTnOTAEuJIJtiEDxaE2lnFbrnalZB69bweIDitf3 bXCr4JlID95qEy+fYmofaooTzZTbvOjnAWs1ZFIxOO8j6YG8eLh05jjact9kwCs74AnV SUUAORriiSdHNb5QHf5HXw4BJFeZ7pURb40oFVrAtPJjp3EgPie6UivFIl80nlVjM7xs 8un5b5VtVGYsFbhLn5k1+U6Ycgrv63323+BtnkeAUMaLYZ9V6BBQ7BmYsxcnPgh3gSGY dv0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=KFhoipy95bP113xci+lGCwwditMpaEg770tCZfK2FCQ=; b=n7Cqgf67/H1EQQTdtGvIY5ZTPBJ1qRfm1Ungasfm1mgDooLT9qcY40k0e7NdXkZhGz dvGxiN7o/LqrzhIeeaKqjo4G43iKm8WEUDpd+nVMZhDGVe+SyYN07gzZRw5XQ5BB8v1p rdzuma2mn8jMgMjirtASMNJDz8sTI4EiQOXhNpIXon0cWnuZsdu37VjJ/sAZauyahN7s lfZrzWVORLWXR3hOgdNPAwYVGXgIpdA2dWwqNam+Gk5Pchy5hCk0Qhj5IYMe+RKbloQZ 4DCQ/bDKKBXqRG2CGZLz/o0WOpx8rmEvFgDoBZpDUswa+Udf+etxLpsaB8mCKjYHfZSB K9Rw== X-Gm-Message-State: ACgBeo3jAzPPaSkxWI9NjjUlQY7bsGRbVkXIGU3M1YjyvJo4+OJt4JVt 0TTTirX1oRB3Widzyq16QjYwmg== X-Google-Smtp-Source: AA6agR7TdIZhRw2FWgfdn4JrRTJl6m+cOEmXzEMBbL7JNgZTvmCbh6hNMPh7aXGBDpWkjiinYDAyqg== X-Received: by 2002:a17:90b:1901:b0:1fa:e81b:fc0e with SMTP id mp1-20020a17090b190100b001fae81bfc0emr7231297pjb.115.1661329197580; Wed, 24 Aug 2022 01:19:57 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.19.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:19:57 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 03/10] sched/psi: save percpu memory when !psi_cgroups_enabled Date: Wed, 24 Aug 2022 16:18:22 +0800 Message-Id: <20220824081829.33748-4-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We won't use cgroup psi_group when !psi_cgroups_enabled, so don't bother to alloc percpu memory and init for it. Also don't need to migrate task PSI stats between cgroups in cgroup_move_task(). Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 39463dcc16bb..77d53c03a76f 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -201,6 +201,7 @@ void __init psi_init(void) { if (!psi_enable) { static_branch_enable(&psi_disabled); + static_branch_disable(&psi_cgroups_enabled); return; } =20 @@ -950,7 +951,7 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; =20 cgroup->psi =3D kzalloc(sizeof(struct psi_group), GFP_KERNEL); @@ -968,7 +969,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) =20 void psi_cgroup_free(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return; =20 cancel_delayed_work_sync(&cgroup->psi->avgs_work); @@ -996,7 +997,7 @@ void cgroup_move_task(struct task_struct *task, struct = css_set *to) struct rq_flags rf; struct rq *rq; =20 - if (static_branch_likely(&psi_disabled)) { + if (!static_branch_likely(&psi_cgroups_enabled)) { /* * Lame to do this here, but the scheduler cannot be locked * from the outside, so we move cgroups from inside sched/. --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1D38C32792 for ; Wed, 24 Aug 2022 08:20:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235691AbiHXIUN (ORCPT ); Wed, 24 Aug 2022 04:20:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57926 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235660AbiHXIUE (ORCPT ); Wed, 24 Aug 2022 04:20:04 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2B34857C2 for ; Wed, 24 Aug 2022 01:20:03 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id u22so15015334plq.12 for ; Wed, 24 Aug 2022 01:20:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=T5Xd8mmYM32aYWsV+JS4zJF3yib8jGf3ov52iW7dkys=; b=2hbOVKhOIJMdq+ESAqgtn9qO5FOZb7JuoMEFLh7q/VunkkOu9QAT5ULHRWLNGQHUmN f+BAdie4GiPCPkFWX4K1h+kjjefd+BKv3QZwl+MX/Wi+1tUdoLPaG1DFS88O0FQD8OIe MuLXlFlHpQY7LCq6GjLVtMo1aGQRwAoCaEiL9D1wC1vXIIQ6YXzfgWEfJeBvS+brm3jC o4YhJAJGyZ5PspBdTHCfEKsF2rRbqlBg6h4ajiQztHfDToeo84tMAeHwNtV8roDKp62a RDIJ5oLfbDn9L+m38KS0qIw2WwsB5HKK2qwtDhQAPv3bCGmDIJym1uNeXaJYeGGm1F/9 J9og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=T5Xd8mmYM32aYWsV+JS4zJF3yib8jGf3ov52iW7dkys=; b=I/j5GOfzYemO9DAwdiK4ZSQkLdRFe0k02F13UbRfTo2RRaPtH8zEIWAvyuMZvXRomh /DonBM9p0KaOR+C/XfkieKO9XXG9EhZpsd22fIbkhoqLfRV+X7B1kAHzsiGdq6A/BKyt CmHvr+f8+TeWKSLUD1TT+83oqtblSg2OtCrcIKmKg2+lrRgiKWjvQnn64/kScyF/GLNd P0ThP8bSTPIHH3cIPt0FLKddRn5PgYPykFFEvYSfVLSPSqkShT2f/9MdE3JF4RlvN6k7 nck0ohR9gmAgZ0WNdtiOC57WKNDh1d0Swyagd0JovXcHT6J0wE4PQo10uNGw5+WGOcwc hs+A== X-Gm-Message-State: ACgBeo3NRpsu3oYGe/LP0D9vCI0G1VZhdw0x8OSeCzjcZJ7tZET54TX7 6rJP2fGYkfyTtKsT3oqIsMZcR0xBitQDbQ== X-Google-Smtp-Source: AA6agR6mL/3prlPOQPzpJzCgrVHKgCOeio2wkX6dkk2PYa+ddmaiTrx7bTioyVdKpdoD66cL7Zx35Q== X-Received: by 2002:a17:90a:c78f:b0:1fa:e505:18e6 with SMTP id gn15-20020a17090ac78f00b001fae50518e6mr7359010pjb.23.1661329203483; Wed, 24 Aug 2022 01:20:03 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.19.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:20:03 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 04/10] sched/psi: move private helpers to sched/stats.h Date: Wed, 24 Aug 2022 16:18:23 +0800 Message-Id: <20220824081829.33748-5-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- include/linux/psi.h | 4 ---- kernel/sched/stats.h | 4 ++++ 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index dd74411ac21d..fffd229fbf19 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -18,10 +18,6 @@ extern struct psi_group psi_system; =20 void psi_init(void); =20 -void psi_task_change(struct task_struct *task, int clear, int set); -void psi_task_switch(struct task_struct *prev, struct task_struct *next, - bool sleep); - void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); =20 diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index baa839c1ba96..c39b467ece43 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se) } =20 #ifdef CONFIG_PSI +void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); + /* * PSI tracks state that persists across sleeps, such as iowaits and * memory stalls. As a result, it has to distinguish between sleeps, --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78EF8C00140 for ; Wed, 24 Aug 2022 08:20:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235681AbiHXIUS (ORCPT ); Wed, 24 Aug 2022 04:20:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57986 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235584AbiHXIUL (ORCPT ); Wed, 24 Aug 2022 04:20:11 -0400 Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F01B37FF93 for ; Wed, 24 Aug 2022 01:20:09 -0700 (PDT) Received: by mail-pj1-x102a.google.com with SMTP id x14-20020a17090a8a8e00b001fb61a71d99so826213pjn.2 for ; Wed, 24 Aug 2022 01:20:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=h1njInj6+0OhDb5dvpxX7bRUejwiRI98B8XQXql9WZQ=; b=6yf8wi5zNwzZEIkCgVGTFObmPAniOIiQplzYcLSqdfynQw4Vz6bS6N+/FmuzSny6ig IbyW0Ll7Ib5/jwQy0MXG8EzqmWCb3Z+6uOTpaat+70mhVmkKwAycfew2QvqpBorPYVpn j8UXv3nEOdb0xJXxSQbqLZu2SGvwDIRpmeOsvG6Meij2IOzxPdRcyTYoyp4tnOaBvrG0 ij0kKjV6+2yUcfuz5jgsZTb1l8Je8MXPyvvGRgtN8nbzcgBCzC+9q+XiNo6ZvIudhAV+ C3bE3iISnulIycKcpmRz9qXnmLn7bqfELsePToRk1Ibz7O2hAhd646bBzd+ws075cyUU k5Vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=h1njInj6+0OhDb5dvpxX7bRUejwiRI98B8XQXql9WZQ=; b=7bVhpiNdrmTzJ/e1u1KBaScDbXMFBsLeucYCi71QRQTvvABoxQ9xYhKOuIcoP0CGoE NYuo5YAuA//JANTMhOHFlFlMqGVdTHgfqtNRSHZhUPQ9/W8n0tByjwuQIudzVXakfIGV d/AtwR/CdRMKBnL6sJf6autB6EMoD686E4YHRkmOsPzGlgG/wH4839mjCvPvd6QS8yTV kYoFRlRtoaF7oJwf3Ry1Dtxa/uG0xhY+IRoSDR1r66EPRSbtMfo0tVFUBu6yo3+zYOZF mbwz24N9mb1m4PYVC/YlMYBYDrXgVDrI/6US9NAZ86R4jA49IfL4/wYnJxqPrp4zEYVo PE3w== X-Gm-Message-State: ACgBeo0mDnXCJruAv5sHt+3JNIjZfndJkVbhkCaZy1gnMe1Y2aWmg+V/ RpX5/fwvIsRpV8rTg3dmHPb0yA== X-Google-Smtp-Source: AA6agR5rrAy3gZcaFBGnbostM2BwS4dc/C5F1+ZodDYHXjtFDsV/x7DOwZWdoQqDtcedLKMDeBp79A== X-Received: by 2002:a17:902:cec1:b0:16d:c4f2:66c5 with SMTP id d1-20020a170902cec100b0016dc4f266c5mr27326630plg.20.1661329209524; Wed, 24 Aug 2022 01:20:09 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.20.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:20:09 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 05/10] sched/psi: optimize task switch inside shared cgroups again Date: Wed, 24 Aug 2022 16:18:24 +0800 Message-Id: <20220824081829.33748-6-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer prev task sleep handling to psi_task_switch(), so we don't need to clear and set TSK_ONCPU state for common cgroups. A | B / \ C D / \ prev next After that commit psi_task_switch() do: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D 2. psi_group_change(prev, .clear=3DTSK_ONCPU | TSK_RUNNING) for C 3. psi_group_change(prev, .clear=3DTSK_RUNNING) for B, A But there is a limitation "prev->psi_flags =3D=3D next->psi_flags" that if not satisfied, will make this cgroups optimization unusable for both sleep switch or running switch cases. For example: prev->in_memstall !=3D next->in_memstall when sleep switch: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=3DTSK_ONCPU | TSK_RUNNING) for C, B, A prev->in_memstall !=3D next->in_memstall when running switch: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=3DTSK_ONCPU) for C, B, A The reason why this limitation exist is that we consider a group is PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive could run even if it were runnable. So when CPU curr changed from prev to next and their in_memstall status is different, we have to change PSI_MEM_FULL status for their common cgroups. This patch remove this limitation by making psi_group_change() change PSI_MEM_FULL status depend on CPU curr->in_memstall status. Signed-off-by: Chengming Zhou --- kernel/sched/psi.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 77d53c03a76f..26c03bd56b9c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -820,8 +820,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, u64 now =3D cpu_clock(cpu); =20 if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* * When switching between tasks that have an identical @@ -829,11 +827,9 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - identical_state =3D prev->psi_flags =3D=3D next->psi_flags; iter =3D NULL; while ((group =3D iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common =3D group; break; } @@ -880,7 +876,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, * TSK_ONCPU is handled up to the common ancestor. If we're tasked * with dequeuing too, finish that for the rest of the hierarchy. */ - if (sleep) { + if (sleep || unlikely(prev->in_memstall !=3D next->in_memstall)) { clear &=3D ~TSK_ONCPU; for (; group; group =3D iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock); --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F0B1C00140 for ; Wed, 24 Aug 2022 08:20:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235828AbiHXIUg (ORCPT ); Wed, 24 Aug 2022 04:20:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58130 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235764AbiHXIUX (ORCPT ); Wed, 24 Aug 2022 04:20:23 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A5AE8A7EA for ; Wed, 24 Aug 2022 01:20:16 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id x19so13012590pfq.1 for ; Wed, 24 Aug 2022 01:20:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=bAVnRb19pf9PwP9xG6OiD8KoZISv9vXBBeWuZGjOxXg=; b=5TE7MY5htLE6CIQSm+EL9kGn4AZR8ubMCRgoUNLToBnYDqA8A7gQ9rGfrWZfmPsZ9v R1MbWWmWth5F0PBQCC06qwmm0jMNPwp/6yq4Jl3OS+h8p3ZO++PLX1Zkc64219ChaaV/ 1bD5G9nI5UXe4+6fOsoTTxHmKAhIjI5NALsIOlbu3PrFC+7CGwIj5gzSMGMp1fZ9gWk+ Q9wbWAVCdqiLCLtcBQqV9lU4VRLR6NtDsMNSghiyrIAWFjCpP9Rg4ai0vwaG1HhTj3Ka ikNrRsRaj0L8LJ1sbuy9fmx2jOADpuCij8iQL+w6Ib3Q3ntGh+c8JsQ+UdKOOvb9a0jU 2zYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=bAVnRb19pf9PwP9xG6OiD8KoZISv9vXBBeWuZGjOxXg=; b=Rumjn8+HD5iP38JtkoKGfqIfJMgc/+fngsex/v26Wia98lRmJqjotIeFMMD0VcqLQM zQy+HsXmjEmmsSta59ia5sfDY8W+FjkeVXW+Y7QC0dKiO5Bq9rOrFIx8hvkA2vrfHGyy S34H4aOIPjU+XInFhy2bcLRVRhbHPAWqd98+gn8UIWUZurK18BP8Rd4Y2pFdDyKNWiQ/ KZRbXFGuPyerm0NFtPDu2Xyi++v8NKWKI/Df3z068dczMtAryvR8j3TckwB8za+lRs0c Mbg9tcwiULexY6L2js0kgPd7jyXVAu63wMKMFib3i0fhqiseYxOoGxGO8ppCpMQbgcYu kK4w== X-Gm-Message-State: ACgBeo0GIjjhcc01FhA8V79HtBEbVE6PZyO2q7kl4x4Ql464zuRaHEjQ O0ozyDXA0oPQTYUvuva8tvsSYQ== X-Google-Smtp-Source: AA6agR5wlAsnCaO9Lk5Bcq7Zui37MrgHL2TTtrCp3qP0l/TUhzYCaUiFbrEJFN7g+OCtaLosxuKn/g== X-Received: by 2002:a63:d84f:0:b0:428:ee87:3769 with SMTP id k15-20020a63d84f000000b00428ee873769mr23966270pgj.212.1661329215397; Wed, 24 Aug 2022 01:20:15 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.20.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:20:15 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 06/10] sched/psi: remove NR_ONCPU task accounting Date: Wed, 24 Aug 2022 16:18:25 +0800 Message-Id: <20220824081829.33748-7-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Johannes Weiner We put all fields updated by the scheduler in the first cacheline of struct psi_group_cpu for performance. Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, we need to reclaim space first. This patch remove NR_ONCPU task accounting in struct psi_group_cpu, use one bit in state_mask to track instead. Signed-off-by: Johannes Weiner Signed-off-by: Chengming Zhou Reviewed-by: Chengming Zhou Tested-by: Chengming Zhou --- include/linux/psi_types.h | 16 +++++++-------- kernel/sched/psi.c | 41 ++++++++++++++++++++++++++++----------- 2 files changed, 37 insertions(+), 20 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c7fe7c089718..54cb74946db4 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -15,13 +15,6 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - /* - * This can't have values other than 0 or 1 and could be - * implemented as a bit flag. But for now we still have room - * in the first cacheline of psi_group_cpu, and this way we - * don't have to special case any state tracking for it. - */ - NR_ONCPU, /* * For IO and CPU stalls the presence of running/oncpu tasks * in the domain means a partial rather than a full stall. @@ -32,16 +25,18 @@ enum psi_task_count { * threads and memstall ones. */ NR_MEMSTALL_RUNNING, - NR_PSI_TASK_COUNTS =3D 5, + NR_PSI_TASK_COUNTS =3D 4, }; =20 /* Task state bitmasks */ #define TSK_IOWAIT (1 << NR_IOWAIT) #define TSK_MEMSTALL (1 << NR_MEMSTALL) #define TSK_RUNNING (1 << NR_RUNNING) -#define TSK_ONCPU (1 << NR_ONCPU) #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING) =20 +/* Only one task can be scheduled, no corresponding task count */ +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS) + /* Resources that workloads could be stalled on */ enum psi_res { PSI_IO, @@ -68,6 +63,9 @@ enum psi_states { NR_PSI_STATES =3D 7, }; =20 +/* Use one bit in the state mask to track TSK_ONCPU */ +#define PSI_ONCPU (1 << NR_PSI_STATES) + enum psi_aggregators { PSI_AVGS =3D 0, PSI_POLL, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 26c03bd56b9c..af83531162fc 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -212,7 +212,7 @@ void __init psi_init(void) group_init(&psi_system); } =20 -static bool test_state(unsigned int *tasks, enum psi_states state) +static bool test_state(unsigned int *tasks, enum psi_states state, bool on= cpu) { switch (state) { case PSI_IO_SOME: @@ -225,9 +225,9 @@ static bool test_state(unsigned int *tasks, enum psi_st= ates state) return unlikely(tasks[NR_MEMSTALL] && tasks[NR_RUNNING] =3D=3D tasks[NR_MEMSTALL_RUNNING]); case PSI_CPU_SOME: - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] > oncpu); case PSI_CPU_FULL: - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] && !oncpu); case PSI_NONIDLE: return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING]; @@ -689,9 +689,9 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, bool wake_clock) { struct psi_group_cpu *groupc; - u32 state_mask =3D 0; unsigned int t, m; enum psi_states s; + u32 state_mask; =20 groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 @@ -707,17 +707,36 @@ static void psi_group_change(struct psi_group *group,= int cpu, =20 record_times(groupc, now); =20 + /* + * Start with TSK_ONCPU, which doesn't have a corresponding + * task count - it's just a boolean flag directly encoded in + * the state mask. Clear, set, or carry the current state if + * no changes are requested. + */ + if (unlikely(clear & TSK_ONCPU)) { + state_mask =3D 0; + clear &=3D ~TSK_ONCPU; + } else if (unlikely(set & TSK_ONCPU)) { + state_mask =3D PSI_ONCPU; + set &=3D ~TSK_ONCPU; + } else { + state_mask =3D groupc->state_mask & PSI_ONCPU; + } + + /* + * The rest of the state mask is calculated based on the task + * counts. Update those first, then construct the mask. + */ for (t =3D 0, m =3D clear; m; m &=3D ~(1 << t), t++) { if (!(m & (1 << t))) continue; if (groupc->tasks[t]) { groupc->tasks[t]--; } else if (!psi_bug) { - printk_deferred(KERN_ERR "psi: task underflow! cpu=3D%d t=3D%d tasks=3D= [%u %u %u %u %u] clear=3D%x set=3D%x\n", + printk_deferred(KERN_ERR "psi: task underflow! cpu=3D%d t=3D%d tasks=3D= [%u %u %u %u] clear=3D%x set=3D%x\n", cpu, t, groupc->tasks[0], groupc->tasks[1], groupc->tasks[2], - groupc->tasks[3], groupc->tasks[4], - clear, set); + groupc->tasks[3], clear, set); psi_bug =3D 1; } } @@ -726,9 +745,8 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, if (set & (1 << t)) groupc->tasks[t]++; =20 - /* Calculate state mask representing active states */ for (s =3D 0; s < NR_PSI_STATES; s++) { - if (test_state(groupc->tasks, s)) + if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU)) state_mask |=3D (1 << s); } =20 @@ -740,7 +758,7 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, * task in a cgroup is in_memstall, the corresponding groupc * on that cpu is in PSI_MEM_FULL state. */ - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)) + if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall)) state_mask |=3D (1 << PSI_MEM_FULL); =20 groupc->state_mask =3D state_mask; @@ -829,7 +847,8 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, */ iter =3D NULL; while ((group =3D iterate_groups(next, &iter))) { - if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->state_mask & + PSI_ONCPU) { common =3D group; break; } --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 692B7C32793 for ; Wed, 24 Aug 2022 08:20:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235777AbiHXIUs (ORCPT ); Wed, 24 Aug 2022 04:20:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58552 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235775AbiHXIUb (ORCPT ); Wed, 24 Aug 2022 04:20:31 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A1B24870B5 for ; Wed, 24 Aug 2022 01:20:21 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id x14-20020a17090a8a8e00b001fb61a71d99so826650pjn.2 for ; Wed, 24 Aug 2022 01:20:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=jZfujxDzqYLLTh+PV5u8gfud+lIMXBhnKnw2hQ6Ybqc=; b=6KxicFRzExbQraTfXX7koDvIn7BKlZUOVnz1JWe4sl1YtjT3AXvIcO7Z6HCEpU+nSV T6smx19Sp6nHOpgVK+5bbWtg1X5FRFpmBT4YJRGX2Y7c0RMBhbVbsAWCK/MpA687Okc/ XSG2Mj6x7RJoFmd10fVqy1njLrQmEjIxRSkeUxaYhl6rGpdYgVGZoFsEZI462UUlASa0 XqaxTSl7kLRegr3OxUUMe1bHzlFNOY6uW771EZ2bAAUs5oBceMkOBf+DPObmTjFG7sda s8zyd0GPtLlXiBGzVmtvYfjrWaaJFWVprLqSNZ2DW0R0DNw4Bn0voQRjWW0Oe6GvWis/ 2Buw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=jZfujxDzqYLLTh+PV5u8gfud+lIMXBhnKnw2hQ6Ybqc=; b=GlIgg/bkMBiabqnbkPVIpuLx8XNEhP4eGVBloFS03iAXFsNltV94BW86wCmVYOKrVQ gfjhyucAD9m3z7zIG6echtNfg7d5qh+WUJyOIHyX/+3Hahf7HhoKPZZRUhNjgrqUfTMr KZGPm2FoRT41CZz145/b4vS41Hvn1cOEEqgvbaKjZVusF24Bu70W7qlsPE/ljMYmdgv/ 086nOkGz50myNk5arsOqOyopWeMTA0uBxFHMC1s7MFO+inTcwbwQxMGtFQ5LSJjufitT OyrYszlAVDJQVR0qZmgdziGn3rXYZNxA+K8VcGx5m/DTJmt699+ood6qs74VB/e0u3vZ jy2g== X-Gm-Message-State: ACgBeo11cnse8s1Tqi+ek2WaNbO/YYbIcEvVsRzdJrniZrktgtw/m+l2 BCEMzPDDeudHHTDiMGy0NkYhTHrAOHmRyw== X-Google-Smtp-Source: AA6agR5B5+7d96iu/jzhBghjDPARYXi2dL/n++/kxT15iYnlMwbi3crgMip2NQf/9o/6aZkYVj8e9g== X-Received: by 2002:a17:902:b104:b0:172:f66d:6604 with SMTP id q4-20020a170902b10400b00172f66d6604mr9492744plr.117.1661329221119; Wed, 24 Aug 2022 01:20:21 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.20.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:20:20 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 07/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Date: Wed, 24 Aug 2022 16:18:26 +0800 Message-Id: <20220824081829.33748-8-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou --- Documentation/admin-guide/cgroup-v2.rst | 6 ++ include/linux/psi_types.h | 10 +++- kernel/cgroup/cgroup.c | 27 +++++++++ kernel/sched/core.c | 1 + kernel/sched/psi.c | 74 ++++++++++++++++++++++++- kernel/sched/stats.h | 2 + 6 files changed, 116 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index be4a77baf784..971c418bc778 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -976,6 +976,12 @@ All cgroup core files are prefixed with "cgroup." killing cgroups is a process directed operation, i.e. it affects the whole thread-group. =20 + irq.pressure + A read-write nested-keyed file. + + Shows pressure stall information for IRQ/SOFTIRQ. See + :ref:`Documentation/accounting/psi.rst ` for details. + Controllers =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 54cb74946db4..40c28171cd91 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -42,7 +42,10 @@ enum psi_res { PSI_IO, PSI_MEM, PSI_CPU, - NR_PSI_RESOURCES =3D 3, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + PSI_IRQ, +#endif + NR_PSI_RESOURCES, }; =20 /* @@ -58,9 +61,12 @@ enum psi_states { PSI_MEM_FULL, PSI_CPU_SOME, PSI_CPU_FULL, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + PSI_IRQ_FULL, +#endif /* Only per-CPU, to weigh the CPU in the global average: */ PSI_NONIDLE, - NR_PSI_STATES =3D 7, + NR_PSI_STATES, }; =20 /* Use one bit in the state mask to track TSK_ONCPU */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 2f79ddf9a85d..8540878469e6 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3731,6 +3731,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kern= fs_open_file *of, return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); } =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int cgroup_irq_pressure_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp =3D seq_css(seq)->cgroup; + struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp-= >psi; + + return psi_show(seq, psi, PSI_IRQ); +} + +static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ); +} +#endif + static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -5150,6 +5167,16 @@ static struct cftype cgroup_base_files[] =3D { .poll =3D cgroup_pressure_poll, .release =3D cgroup_pressure_release, }, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + { + .name =3D "irq.pressure", + .flags =3D CFTYPE_PRESSURE, + .seq_show =3D cgroup_irq_pressure_show, + .write =3D cgroup_irq_pressure_write, + .poll =3D cgroup_pressure_poll, + .release =3D cgroup_pressure_release, + }, +#endif #endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 61436b8e0337..178f9836ae96 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 del= ta) =20 rq->prev_irq_time +=3D irq_delta; delta -=3D irq_delta; + psi_account_irqtime(rq->curr, irq_delta); #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index af83531162fc..7aab6f13ed12 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -903,6 +903,36 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, } } =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +void psi_account_irqtime(struct task_struct *task, u32 delta) +{ + int cpu =3D task_cpu(task); + void *iter =3D NULL; + struct psi_group *group; + struct psi_group_cpu *groupc; + u64 now; + + if (!task->pid) + return; + + now =3D cpu_clock(cpu); + + while ((group =3D iterate_groups(task, &iter))) { + groupc =3D per_cpu_ptr(group->pcpu, cpu); + + write_seqcount_begin(&groupc->seq); + + record_times(groupc, now); + groupc->times[PSI_IRQ_FULL] +=3D delta; + + write_seqcount_end(&groupc->seq); + + if (group->poll_states & (1 << PSI_IRQ_FULL)) + psi_schedule_poll_work(group, 1); + } +} +#endif + /** * psi_memstall_enter - mark the beginning of a memory stall section * @flags: flags to handle nested sections @@ -1064,6 +1094,7 @@ void cgroup_move_task(struct task_struct *task, struc= t css_set *to) =20 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { + bool only_full =3D false; int full; u64 now; =20 @@ -1078,7 +1109,11 @@ int psi_show(struct seq_file *m, struct psi_group *g= roup, enum psi_res res) group->avg_next_update =3D update_averages(group, now); mutex_unlock(&group->avgs_lock); =20 - for (full =3D 0; full < 2; full++) { +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + only_full =3D res =3D=3D PSI_IRQ; +#endif + + for (full =3D 0; full < 2 - only_full; full++) { unsigned long avg[3] =3D { 0, }; u64 total =3D 0; int w; @@ -1092,7 +1127,7 @@ int psi_show(struct seq_file *m, struct psi_group *gr= oup, enum psi_res res) } =20 seq_printf(m, "%s avg10=3D%lu.%02lu avg60=3D%lu.%02lu avg300=3D%lu.%02lu= total=3D%llu\n", - full ? "full" : "some", + full || only_full ? "full" : "some", LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), LOAD_INT(avg[1]), LOAD_FRAC(avg[1]), LOAD_INT(avg[2]), LOAD_FRAC(avg[2]), @@ -1120,6 +1155,11 @@ struct psi_trigger *psi_trigger_create(struct psi_gr= oup *group, else return ERR_PTR(-EINVAL); =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + if (res =3D=3D PSI_IRQ && --state !=3D PSI_IRQ_FULL) + return ERR_PTR(-EINVAL); +#endif + if (state >=3D PSI_NONIDLE) return ERR_PTR(-EINVAL); =20 @@ -1404,6 +1444,33 @@ static const struct proc_ops psi_cpu_proc_ops =3D { .proc_release =3D psi_fop_release, }; =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int psi_irq_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IRQ); +} + +static int psi_irq_open(struct inode *inode, struct file *file) +{ + return psi_open(file, psi_irq_show); +} + +static ssize_t psi_irq_write(struct file *file, const char __user *user_bu= f, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_IRQ); +} + +static const struct proc_ops psi_irq_proc_ops =3D { + .proc_open =3D psi_irq_open, + .proc_read =3D seq_read, + .proc_lseek =3D seq_lseek, + .proc_write =3D psi_irq_write, + .proc_poll =3D psi_fop_poll, + .proc_release =3D psi_fop_release, +}; +#endif + static int __init psi_proc_init(void) { if (psi_enable) { @@ -1411,6 +1478,9 @@ static int __init psi_proc_init(void) proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops); +#endif } return 0; } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index c39b467ece43..84a188913cc9 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -110,6 +110,7 @@ __schedstats_from_se(struct sched_entity *se) void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); +void psi_account_irqtime(struct task_struct *task, u32 delta); =20 /* * PSI tracks state that persists across sleeps, such as iowaits and @@ -205,6 +206,7 @@ static inline void psi_ttwu_dequeue(struct task_struct = *p) {} static inline void psi_sched_switch(struct task_struct *prev, struct task_struct *next, bool sleep) {} +static inline void psi_account_irqtime(struct task_struct *task, u32 delta= ) {} #endif /* CONFIG_PSI */ =20 #ifdef CONFIG_SCHED_INFO --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD443C00140 for ; Wed, 24 Aug 2022 08:20:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235823AbiHXIU5 (ORCPT ); Wed, 24 Aug 2022 04:20:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58146 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235766AbiHXIUe (ORCPT ); Wed, 24 Aug 2022 04:20:34 -0400 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 051FA85FCD for ; Wed, 24 Aug 2022 01:20:27 -0700 (PDT) Received: by mail-pl1-x634.google.com with SMTP id p13so3662555pld.6 for ; Wed, 24 Aug 2022 01:20:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=6sL83Ah5J4Txh12rTADuAWsNxWsApjv5n3GbxKgC1I4=; b=YyTOMGVC+f89/pninVXkLZTBAdHTuaIb2YEzmTpr0YhcBAB/Fl4riv1XmMp9t7/IgX DZN8LVI7gNWRHy74o7YokuVqq1F2V9MeAgwS+n2bGk0A5lEZLVj/5H1yAyJLv1nZ8aHb cCgN54uvzD7TG8tNlDqC5/pFIp+TBxvWFiBMy39oBNOjwMY5v011r/CuvA2HDHuNeRMR RMjTW6HpfO2VHwkeuJ8BM5zgJ2mhEUyAvpwMZQXQ/dsOyRhz8klkfhkJCODvAkjs80Oz FcTC5cO5EP1KDIEJwLV+Y1QMPimq9b3+58Ho3Sp/QIX1vYUTP0i7a6U/bkzjV9Z3KsR7 oLHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=6sL83Ah5J4Txh12rTADuAWsNxWsApjv5n3GbxKgC1I4=; b=nJDnhyWah+Yyd66MoQTbfp7myCQpnl7e0wu+dj50LoqpgoMxAAtv3HUiQ4xYJ5MieV zNJ52X7tYDyrcFGcmO0uVYVVUwYAsb/EkARPo6+AAsvNQzO9RtaheFKspH1y9TetMIRX Xnf6bFzLvYGLV9U7T/BISR3/nazuEmnt169AV58TgvzNMQXZ/oT/Ap/np2PCvVCRGSRA xPV5NNvBpFECu4pWG4s9P9B9dwYkU2NZIP5YyfnSNHTNeDal5x6by7UTigoRRb3HLjw+ vi/9dDye1MGo0wl682NpQctisAI3RknPFeJKoYDugIrXbRTxJFzVsQgbu5f6RPAq11nH MN/g== X-Gm-Message-State: ACgBeo1VbjNEgjE61a8paTO7LjtpSqDAcO9IY0X2EsOIyvRaw/72dXxW t9i2U41LkOIR6/aBIQamQGqsug== X-Google-Smtp-Source: AA6agR4WRBQ7a2V97gttdVqlEdAiRRGcv0hJ0GPlAD/yUyckgX7Do4diSR5lQ71J0urZXuz/wb5sxA== X-Received: by 2002:a17:90b:1a8d:b0:1fb:3ab0:a470 with SMTP id ng13-20020a17090b1a8d00b001fb3ab0a470mr7323838pjb.154.1661329226958; Wed, 24 Aug 2022 01:20:26 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.20.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:20:26 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 08/10] sched/psi: consolidate cgroup_psi() Date: Wed, 24 Aug 2022 16:18:27 +0800 Message-Id: <20220824081829.33748-9-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" cgroup_psi() can't return psi_group for root cgroup, so we have many open code "psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp->psi". This patch move cgroup_psi() definition to , in which we can return psi_system for root cgroup, so can handle all cgroups. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- include/linux/cgroup.h | 5 ----- include/linux/psi.h | 6 ++++++ kernel/cgroup/cgroup.c | 10 +++++----- 3 files changed, 11 insertions(+), 10 deletions(-) diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 7ed1fa7a6fc8..3c48753f2949 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -682,11 +682,6 @@ static inline void pr_cont_cgroup_path(struct cgroup *= cgrp) pr_cont_kernfs_path(cgrp->kn); } =20 -static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) -{ - return cgrp->psi; -} - bool cgroup_psi_enabled(void); =20 static inline void cgroup_init_kthreadd(void) diff --git a/include/linux/psi.h b/include/linux/psi.h index fffd229fbf19..362a74ca1d3b 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -7,6 +7,7 @@ #include #include #include +#include =20 struct seq_file; struct css_set; @@ -30,6 +31,11 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct fil= e *file, poll_table *wait); =20 #ifdef CONFIG_CGROUPS +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp->psi; +} + int psi_cgroup_alloc(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp); void cgroup_move_task(struct task_struct *p, struct css_set *to); diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 8540878469e6..cc228235ce38 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3657,21 +3657,21 @@ static int cpu_stat_show(struct seq_file *seq, void= *v) static int cgroup_io_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp =3D seq_css(seq)->cgroup; - struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp-= >psi; + struct psi_group *psi =3D cgroup_psi(cgrp); =20 return psi_show(seq, psi, PSI_IO); } static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp =3D seq_css(seq)->cgroup; - struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp-= >psi; + struct psi_group *psi =3D cgroup_psi(cgrp); =20 return psi_show(seq, psi, PSI_MEM); } static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp =3D seq_css(seq)->cgroup; - struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp-= >psi; + struct psi_group *psi =3D cgroup_psi(cgrp); =20 return psi_show(seq, psi, PSI_CPU); } @@ -3697,7 +3697,7 @@ static ssize_t cgroup_pressure_write(struct kernfs_op= en_file *of, char *buf, return -EBUSY; } =20 - psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp->psi; + psi =3D cgroup_psi(cgrp); new =3D psi_trigger_create(psi, buf, res); if (IS_ERR(new)) { cgroup_put(cgrp); @@ -3735,7 +3735,7 @@ static ssize_t cgroup_cpu_pressure_write(struct kernf= s_open_file *of, static int cgroup_irq_pressure_show(struct seq_file *seq, void *v) { struct cgroup *cgrp =3D seq_css(seq)->cgroup; - struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : cgrp-= >psi; + struct psi_group *psi =3D cgroup_psi(cgrp); =20 return psi_show(seq, psi, PSI_IRQ); } --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DCB39C00140 for ; Wed, 24 Aug 2022 08:21:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235906AbiHXIVD (ORCPT ); Wed, 24 Aug 2022 04:21:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58782 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235695AbiHXIUh (ORCPT ); Wed, 24 Aug 2022 04:20:37 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFB948A7EA for ; Wed, 24 Aug 2022 01:20:33 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id p18so15026844plr.8 for ; Wed, 24 Aug 2022 01:20:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=kvXw0ka1ZDD0lFwIGZxik+dCkFG4/lruKQeSxN5T30s=; b=nkGT4TyusGK+mWMOOY7Xm3rXsH00GUhDlwqaTv7vf3U9tOqYCUzuAqPS43Ktm7YyNh id2VfvdP2ZUXqnnyt7ongxGCdKo3Ar4OlPb/Y18JYXSe+794stCN5JFluaz2xFFNt1zB l2utc1YLnYAT6BVvoYvfm3upjnONzJOrtTwN9bGZXk2ENHrtXhlE9xB0Sn81McrAmZEc 4V3JruzuwcJb6VqFS28Zc3S3kBwk5S1yGcEU06DwsfVyRQ6Spc8JSuVTPFzoBOpwZaOV LRSG09bUlYZjKp7rYgmHRCq7/BIzZIn27vPeMdod8FhjgFD3jd0znTejuWRrnIKYDmSo 1Hxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=kvXw0ka1ZDD0lFwIGZxik+dCkFG4/lruKQeSxN5T30s=; b=nL2snEQSq/kiADBgATf6ZiAhDzvP4Ix6uqNyVfnKE0mciii0iyLJo6aNv90azL/MBj FhxrTNfX9DFRnUayJGcG+VC5JPnTAcvc5oTuYVq93GCx1fYlnJR0sfcTThyoFeBbpQPD JaBNY9XTgrxmb1p36aoVxoSz0wn2zx05oKEfMGJDq7n5/0kIm33bMf58Qy2xPUFfAaAZ P5LxrSC1fybAdAWOsc2/eTIs9SyCRByUib3F/soMKduv2LZDWoUErBPG/nl6xWL73G3e iu3OkCJmzixmvffFzxhC45AZ1HboZErdgxthB/BMc3kipltdlEaBCrin5+/ys21HhFYz f4mQ== X-Gm-Message-State: ACgBeo1oj2iIvYDkiL0cwcfDw6mbBih+zu7r2cihdgGonnKd/gDMYrOC fuwBvXIBTIui+aDH+JZpKsP/4Q== X-Google-Smtp-Source: AA6agR6z6/4fcGhCFstCPsL+RGBmWVxsFfozY/bJYVFZTD7+Zw94jl3IWpPYHq2rs62jvDcM4c/8BA== X-Received: by 2002:a17:902:ce0e:b0:172:69cc:60aa with SMTP id k14-20020a170902ce0e00b0017269cc60aamr27310024plg.31.1661329232881; Wed, 24 Aug 2022 01:20:32 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.20.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:20:32 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 09/10] sched/psi: cache parent psi_group to speed up groups iterate Date: Wed, 24 Aug 2022 16:18:28 +0800 Message-Id: <20220824081829.33748-10-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We use iterate_groups() to iterate each level psi_group to update PSI stats, which is a very hot path. In current code, iterate_groups() have to use multiple branches and cgroup_parent() to get parent psi_group for each level, which is not very efficient. This patch cache parent psi_group in struct psi_group, only need to get psi_group of task itself first, then just use group->parent to iterate. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- include/linux/psi_types.h | 2 ++ kernel/sched/psi.c | 47 ++++++++++++++++----------------------- 2 files changed, 21 insertions(+), 28 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 40c28171cd91..a0b746258c68 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -151,6 +151,8 @@ struct psi_trigger { }; =20 struct psi_group { + struct psi_group *parent; + /* Protects data used by the aggregator */ struct mutex avgs_lock; =20 diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 7aab6f13ed12..814e99b1fed3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -772,30 +772,18 @@ static void psi_group_change(struct psi_group *group,= int cpu, schedule_delayed_work(&group->avgs_work, PSI_FREQ); } =20 -static struct psi_group *iterate_groups(struct task_struct *task, void **i= ter) +static inline struct psi_group *task_psi_group(struct task_struct *task) { - if (*iter =3D=3D &psi_system) - return NULL; - #ifdef CONFIG_CGROUPS - if (static_branch_likely(&psi_cgroups_enabled)) { - struct cgroup *cgroup =3D NULL; - - if (!*iter) - cgroup =3D task->cgroups->dfl_cgrp; - else - cgroup =3D cgroup_parent(*iter); - - if (cgroup && cgroup_parent(cgroup)) { - *iter =3D cgroup; - return cgroup_psi(cgroup); - } - } + if (static_branch_likely(&psi_cgroups_enabled)) + return cgroup_psi(task_dfl_cgroup(task)); #endif - *iter =3D &psi_system; return &psi_system; } =20 +#define for_each_psi_group(group) \ + for (; group; group =3D group->parent) + static void psi_flags_change(struct task_struct *task, int clear, int set) { if (((task->psi_flags & set) || @@ -815,7 +803,6 @@ void psi_task_change(struct task_struct *task, int clea= r, int set) { int cpu =3D task_cpu(task); struct psi_group *group; - void *iter =3D NULL; u64 now; =20 if (!task->pid) @@ -825,7 +812,8 @@ void psi_task_change(struct task_struct *task, int clea= r, int set) =20 now =3D cpu_clock(cpu); =20 - while ((group =3D iterate_groups(task, &iter))) + group =3D task_psi_group(task); + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } =20 @@ -834,7 +822,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, { struct psi_group *group, *common =3D NULL; int cpu =3D task_cpu(prev); - void *iter; u64 now =3D cpu_clock(cpu); =20 if (next->pid) { @@ -845,8 +832,8 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - iter =3D NULL; - while ((group =3D iterate_groups(next, &iter))) { + group =3D task_psi_group(next); + for_each_psi_group(group) { if (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU) { common =3D group; @@ -887,9 +874,12 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, =20 psi_flags_change(prev, clear, set); =20 - iter =3D NULL; - while ((group =3D iterate_groups(prev, &iter)) && group !=3D common) + group =3D task_psi_group(prev); + for_each_psi_group(group) { + if (group =3D=3D common) + break; psi_group_change(group, cpu, clear, set, now, wake_clock); + } =20 /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -897,7 +887,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, */ if (sleep || unlikely(prev->in_memstall !=3D next->in_memstall)) { clear &=3D ~TSK_ONCPU; - for (; group; group =3D iterate_groups(prev, &iter)) + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, wake_clock); } } @@ -907,7 +897,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, void psi_account_irqtime(struct task_struct *task, u32 delta) { int cpu =3D task_cpu(task); - void *iter =3D NULL; struct psi_group *group; struct psi_group_cpu *groupc; u64 now; @@ -917,7 +906,8 @@ void psi_account_irqtime(struct task_struct *task, u32 = delta) =20 now =3D cpu_clock(cpu); =20 - while ((group =3D iterate_groups(task, &iter))) { + group =3D task_psi_group(task); + for_each_psi_group(group) { groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 write_seqcount_begin(&groupc->seq); @@ -1009,6 +999,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) return -ENOMEM; } group_init(cgroup->psi); + cgroup->psi->parent =3D cgroup_psi(cgroup_parent(cgroup)); return 0; } =20 --=20 2.37.2 From nobody Wed Apr 8 02:49:45 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 256B1C00140 for ; Wed, 24 Aug 2022 08:21:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235784AbiHXIVQ (ORCPT ); Wed, 24 Aug 2022 04:21:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235764AbiHXIUq (ORCPT ); Wed, 24 Aug 2022 04:20:46 -0400 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 584B790801 for ; Wed, 24 Aug 2022 01:20:40 -0700 (PDT) Received: by mail-pf1-x42e.google.com with SMTP id w138so13172261pfc.10 for ; Wed, 24 Aug 2022 01:20:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=lICs8ApUtixGeJYEFWmc/ebkWhxD3t+4EoUbg1sIx78=; b=AMJ2Y2yduW28QhGlKqlsO8sK+FpEMD6cAwzHQOTimGT3b6zHky1HuuPtcbgglJE4q6 f/Zm/tIpkavx23KZpKepxKNpDcoNH7+Px4g1oReMnaZv3/RwYonwRGWJlIUyEAhLenqx 9OpwO2WAH/uo8RBzLaHFNRUEQjiSrzK3JQqvatPMh3crajWuCNkT8rRrpYxywO5xx4e0 Ug7Pzy/gNIbeATcptcHs+rQpq6lmVpkBw5tKUUEoSVVx6bvhzt8oBzUhb1ZVEiarAlnv IP1QNKjzhJ5JcbZl7GQl7LBHGJf98Dj+2G407B4tD6EDjEUiWXDDob//UppWwvCa4LVB dkjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=lICs8ApUtixGeJYEFWmc/ebkWhxD3t+4EoUbg1sIx78=; b=rzojwE+w8pRuLie78bklNhqWZBGyaIe8o+hm37ZYoeyybmp1DQuyqt8zd7ikjfDu4s Q44e5cJwUDymHlMAHPwvfKKl8e5B4GQecqK4yVIQxXUWMegCJs84pDytG784pe5F7p6q oIUw24oellIV69oLVBoRFxR/k9v9buOE/rTUdsjPqsLC40GvUG1PCDS0ITzbamptOEUg xmap6LtmUj36jwojUvM1ydya+21Ayuji8sRAuMKy+OYybdqbRjy6VPaG8IVkdmCx/yED Cz/SyoVPma5sC4r29sgpgZq919piNULl27YuHxq1zc6xKuayyk3CzG0K6LzEY+PfXIf0 yzyg== X-Gm-Message-State: ACgBeo31E/Lgb0MTBcxJ+7bUd0SbZfaCRa2gdxCxfa1I8Zs5Ke2vkVj8 AjqMjXfuyb+3EYHabI3cb3uHUw== X-Google-Smtp-Source: AA6agR7u5dOfZbEXaWEp2YwbxF84Omuc4xWaTwNI2uMNAmW0+j2lm+zTxtgv3YaVudpzwDqypRD1Dg== X-Received: by 2002:a65:6a46:0:b0:41b:65fa:b09e with SMTP id o6-20020a656a46000000b0041b65fab09emr24123763pgu.292.1661329239416; Wed, 24 Aug 2022 01:20:39 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.244]) by smtp.gmail.com with ESMTPSA id q31-20020a635c1f000000b00421841943dfsm10486587pgb.12.2022.08.24.01.20.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 01:20:39 -0700 (PDT) From: Chengming Zhou To: tj@kernel.org, hannes@cmpxchg.org, mkoutny@suse.com, surenb@google.com Cc: gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Chengming Zhou Subject: [PATCH v3 10/10] sched/psi: per-cgroup PSI accounting disable/re-enable interface Date: Wed, 24 Aug 2022 16:18:29 +0800 Message-Id: <20220824081829.33748-11-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220824081829.33748-1-zhouchengming@bytedance.com> References: <20220824081829.33748-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking confi= gurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. But for our use case, we also want leaf cgroup PSI stats accounted for userspace adjustment on that cgroup, apart from only system-wide adjustment. So this patch introduce a per-cgroup PSI accounting disable/re-enable interface "cgroup.pressure", which is a read-write single value file that allowed values are "0" and "1", the defaults is "1" so per-cgroup PSI stats is enabled by default. Implementation details: It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total=3D will have gaps. But it's hard or complex to stop/restart groupc->tasks[] updates, which is not implemented in this patch. So we always update groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when the cgroup PSI stats is disabled. Suggested-by: Johannes Weiner Suggested-by: Tejun Heo Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- Documentation/admin-guide/cgroup-v2.rst | 17 +++++++ include/linux/cgroup-defs.h | 3 ++ include/linux/psi.h | 2 + include/linux/psi_types.h | 1 + kernel/cgroup/cgroup.c | 56 +++++++++++++++++++++++ kernel/sched/psi.c | 59 ++++++++++++++++++++++--- 6 files changed, 131 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 971c418bc778..4cad4e2b31ec 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -976,6 +976,23 @@ All cgroup core files are prefixed with "cgroup." killing cgroups is a process directed operation, i.e. it affects the whole thread-group. =20 + cgroup.pressure + A read-write single value file that allowed values are "0" and "1". + The default is "1". + + Writing "0" to the file will disable the cgroup PSI accounting. + Writing "1" to the file will re-enable the cgroup PSI accounting. + + This control attribute is not hierarchical, so disable or enable PSI + accounting in a cgroup does not affect PSI accounting in descendants + and doesn't need pass enablement via ancestors from root. + + The reason this control attribute exists is that PSI accounts stalls for + each cgroup separately and aggregates it at each level of the hierarchy. + This may cause non-negligible overhead for some workloads when under + deep level of the hierarchy, in which case this control attribute can + be used to disable PSI accounting in the non-leaf cgroups. + irq.pressure A read-write nested-keyed file. =20 diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 1283993d7ea8..cfdb74a89c5c 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -428,6 +428,9 @@ struct cgroup { struct cgroup_file procs_file; /* handle for "cgroup.procs" */ struct cgroup_file events_file; /* handle for "cgroup.events" */ =20 + /* handles for "{cpu,memory,io,irq}.pressure" */ + struct cgroup_file psi_files[NR_PSI_RESOURCES]; + /* * The bitmask of subsystems enabled on the child cgroups. * ->subtree_control is the one configured through diff --git a/include/linux/psi.h b/include/linux/psi.h index 362a74ca1d3b..b09c0c611fa7 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -39,6 +39,7 @@ static inline struct psi_group *cgroup_psi(struct cgroup = *cgrp) int psi_cgroup_alloc(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp); void cgroup_move_task(struct task_struct *p, struct css_set *to); +void psi_cgroup_enabled_sync(struct psi_group *group); #endif =20 #else /* CONFIG_PSI */ @@ -60,6 +61,7 @@ static inline void cgroup_move_task(struct task_struct *p= , struct css_set *to) { rcu_assign_pointer(p->cgroups, to); } +static inline void psi_cgroup_enabled_sync(struct psi_group *group) {} #endif =20 #endif /* CONFIG_PSI */ diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index a0b746258c68..ab1f9b463df9 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -152,6 +152,7 @@ struct psi_trigger { =20 struct psi_group { struct psi_group *parent; + bool enabled; =20 /* Protects data used by the aggregator */ struct mutex avgs_lock; diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index cc228235ce38..fa8428125d62 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3748,6 +3748,52 @@ static ssize_t cgroup_irq_pressure_write(struct kern= fs_open_file *of, } #endif =20 +static int cgroup_psi_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp =3D seq_css(seq)->cgroup; + struct psi_group *psi =3D cgroup_psi(cgrp); + + seq_printf(seq, "%d\n", psi->enabled); + + return 0; +} + +static ssize_t cgroup_psi_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + ssize_t ret; + int enable; + struct cgroup *cgrp; + struct psi_group *psi; + + ret =3D kstrtoint(strstrip(buf), 0, &enable); + if (ret) + return ret; + + if (enable < 0 || enable > 1) + return -ERANGE; + + cgrp =3D cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENOENT; + + psi =3D cgroup_psi(cgrp); + if (psi->enabled !=3D enable) { + int i; + + /* show or hide {cpu,memory,io,irq}.pressure files */ + for (i =3D 0; i < NR_PSI_RESOURCES; i++) + cgroup_file_show(&cgrp->psi_files[i], enable); + + psi->enabled =3D enable; + psi_cgroup_enabled_sync(psi); + } + + cgroup_kn_unlock(of->kn); + + return nbytes; +} + static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -5146,6 +5192,7 @@ static struct cftype cgroup_base_files[] =3D { { .name =3D "io.pressure", .flags =3D CFTYPE_PRESSURE, + .file_offset =3D offsetof(struct cgroup, psi_files[PSI_IO]), .seq_show =3D cgroup_io_pressure_show, .write =3D cgroup_io_pressure_write, .poll =3D cgroup_pressure_poll, @@ -5154,6 +5201,7 @@ static struct cftype cgroup_base_files[] =3D { { .name =3D "memory.pressure", .flags =3D CFTYPE_PRESSURE, + .file_offset =3D offsetof(struct cgroup, psi_files[PSI_MEM]), .seq_show =3D cgroup_memory_pressure_show, .write =3D cgroup_memory_pressure_write, .poll =3D cgroup_pressure_poll, @@ -5162,6 +5210,7 @@ static struct cftype cgroup_base_files[] =3D { { .name =3D "cpu.pressure", .flags =3D CFTYPE_PRESSURE, + .file_offset =3D offsetof(struct cgroup, psi_files[PSI_CPU]), .seq_show =3D cgroup_cpu_pressure_show, .write =3D cgroup_cpu_pressure_write, .poll =3D cgroup_pressure_poll, @@ -5171,12 +5220,19 @@ static struct cftype cgroup_base_files[] =3D { { .name =3D "irq.pressure", .flags =3D CFTYPE_PRESSURE, + .file_offset =3D offsetof(struct cgroup, psi_files[PSI_IRQ]), .seq_show =3D cgroup_irq_pressure_show, .write =3D cgroup_irq_pressure_write, .poll =3D cgroup_pressure_poll, .release =3D cgroup_pressure_release, }, #endif + { + .name =3D "cgroup.pressure", + .flags =3D CFTYPE_PRESSURE, + .seq_show =3D cgroup_psi_show, + .write =3D cgroup_psi_write, + }, #endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 814e99b1fed3..27bd4946d563 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group) { int cpu; =20 + group->enabled =3D true; for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); group->avg_last_update =3D sched_clock(); @@ -696,17 +697,16 @@ static void psi_group_change(struct psi_group *group,= int cpu, groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 /* - * First we assess the aggregate resource states this CPU's - * tasks have been in since the last change, and account any - * SOME and FULL time these may have resulted in. - * - * Then we update the task counts according to the state + * First we update the task counts according to the state * change requested through the @clear and @set bits. + * + * Then if the cgroup PSI stats accounting enabled, we + * assess the aggregate resource states this CPU's tasks + * have been in since the last change, and account any + * SOME and FULL time these may have resulted in. */ write_seqcount_begin(&groupc->seq); =20 - record_times(groupc, now); - /* * Start with TSK_ONCPU, which doesn't have a corresponding * task count - it's just a boolean flag directly encoded in @@ -745,6 +745,14 @@ static void psi_group_change(struct psi_group *group, = int cpu, if (set & (1 << t)) groupc->tasks[t]++; =20 + if (!group->enabled) { + if (groupc->state_mask & (1 << PSI_NONIDLE)) + record_times(groupc, now); + groupc->state_mask =3D state_mask; + write_seqcount_end(&groupc->seq); + return; + } + for (s =3D 0; s < NR_PSI_STATES; s++) { if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU)) state_mask |=3D (1 << s); @@ -761,6 +769,7 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall)) state_mask |=3D (1 << PSI_MEM_FULL); =20 + record_times(groupc, now); groupc->state_mask =3D state_mask; =20 write_seqcount_end(&groupc->seq); @@ -908,6 +917,8 @@ void psi_account_irqtime(struct task_struct *task, u32 = delta) =20 group =3D task_psi_group(task); for_each_psi_group(group) { + if (!group->enabled) + continue; groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 write_seqcount_begin(&groupc->seq); @@ -1081,6 +1092,40 @@ void cgroup_move_task(struct task_struct *task, stru= ct css_set *to) =20 task_rq_unlock(rq, task, &rf); } + +void psi_cgroup_enabled_sync(struct psi_group *group) +{ + int cpu; + + /* + * After we disable psi_group->enabled, we don't actually + * stop percpu tasks accounting in each psi_group_cpu, + * instead only stop test_state() loop, record_times() + * and averaging worker, see psi_group_change() for details. + * + * When disable cgroup PSI, this function has nothing to sync + * since cgroup pressure files are hidden and percpu psi_group_cpu + * would see !psi_group->enabled and only do task accounting. + * + * When re-enable cgroup PSI, this function use psi_group_change() + * to get correct state mask from test_state() loop on tasks[], + * and restart groupc->state_start from now, use .clear =3D .set =3D 0 + * here since no task status really changed. + */ + if (!group->enabled) + return; + + for_each_possible_cpu(cpu) { + struct rq *rq =3D cpu_rq(cpu); + struct rq_flags rf; + u64 now; + + rq_lock_irq(rq, &rf); + now =3D cpu_clock(cpu); + psi_group_change(group, cpu, 0, 0, now, true); + rq_unlock_irq(rq, &rf); + } +} #endif /* CONFIG_CGROUPS */ =20 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) --=20 2.37.2