From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0295C433EF for ; Thu, 21 Jul 2022 04:05:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231574AbiGUEFK (ORCPT ); Thu, 21 Jul 2022 00:05:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43578 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231473AbiGUEFA (ORCPT ); Thu, 21 Jul 2022 00:05:00 -0400 Received: from mail-pj1-x1032.google.com (mail-pj1-x1032.google.com [IPv6:2607:f8b0:4864:20::1032]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32C5F78DDD for ; Wed, 20 Jul 2022 21:04:59 -0700 (PDT) Received: by mail-pj1-x1032.google.com with SMTP id o5-20020a17090a3d4500b001ef76490983so341343pjf.2 for ; Wed, 20 Jul 2022 21:04:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=nYIMFSr31+Cppft00ZwweosK4yTHz5KYhWo2OmHr+0M=; b=QdUORMkLmkF1TeBchXvDKwbtQ2w25HmZACJcQ3QxpqqcR1ttFDE/QNbeNQYBAHZ7AN lh8r5emk2M1x/5gbPuxfBPY9Tse0ukWz4QyGyGKvHOPcKxj7UlN5onU2xoMn0WvVc5pJ azepAXLIuZ7+ZdoOpiT+N0hz4N59K+GQzyPomlShzbfMg0afAYoqa0RXr4KYpqYP4PR0 Em1w0fgoj2WAoZ4+DMngFY4BtLRNqlFcythX0GrV41jDEaRuxoNCtY7kW1ochI5Q10GY DW5UG1/p1iP2b6JQ5Co2PiWbwGYRLhUZYO9dTLDASdn/VkwIgqXolNi6Wy1Wi8EGX8bB Xung== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=nYIMFSr31+Cppft00ZwweosK4yTHz5KYhWo2OmHr+0M=; b=Fd9rRamZ5oJIPi6XjH475Vau48CuUi72RrN4AdHmgMUzjp0K9UwdUSBmo8uQgB62lq rdXSvIynX2b5pKWP1g+Yz9T/0cFB5OJRyZLvf6ZyN8n+N955h7YLZ3FZF1LC04+uo1cb oM65FoAL12TbXxpu0BCe1ctt6OWOj4soaUP/1D1DuwwTodkBpYd/qFNc1HA2Xso8YmOX DinyyIU/kQAVN293Eeh61HJUIOM3F9fOZT/7/J4c2kAJsXpZCNBq0eWVNBuWR3BE1YiR Vfhemw0jRk7How2cBlktGx1v88jYpfwAj7DlPy8rzsJHzrD0aGMcr717zDRM3l20sjcy f7lQ== X-Gm-Message-State: AJIora/YJUq+ntKA/iU+gjkgPu4LWOVBgpnTrGzGiKYi5ra/nj9X7mET Q7FQu1AN22CYo/PhFzD93y4Arw== X-Google-Smtp-Source: AGRyM1sjnwAAHoTI14u32rz/nIWx7dzlWwrjiMgT+NwHgyJvevVkF24sjNzGM+yHxeQsw4n9kSvyfA== X-Received: by 2002:a17:90b:4a08:b0:1ef:f36b:18e1 with SMTP id kk8-20020a17090b4a0800b001eff36b18e1mr9186437pjb.246.1658376298688; Wed, 20 Jul 2022 21:04:58 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.04.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:04:58 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 1/9] sched/psi: fix periodic aggregation shut off Date: Thu, 21 Jul 2022 12:04:31 +0800 Message-Id: <20220721040439.2651-2-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever. Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change(). But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer task sleep handling to psi_task_switch(), won't go through psi_task_change() anymore. So this patch move this check to psi_task_switch(). Note for defer sleep case, we should wake periodic avgs work for common ancestors groups, since those groups have next task sched_in. Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/sched/psi.c | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a337f3e35997..c8a4e644cd2c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -800,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clea= r, int set) { int cpu =3D task_cpu(task); struct psi_group *group; - bool wake_clock =3D true; void *iter =3D NULL; u64 now; =20 @@ -810,19 +809,9 @@ void psi_task_change(struct task_struct *task, int cle= ar, int set) psi_flags_change(task, clear, set); =20 now =3D cpu_clock(cpu); - /* - * Periodic aggregation shuts off if there is a period of no - * task changes, so we wake it back up if necessary. However, - * don't do this if the task change is the aggregation worker - * itself going to sleep, or we'll ping-pong forever. - */ - if (unlikely((clear & TSK_RUNNING) && - (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) =3D=3D psi_avgs_work)) - wake_clock =3D false; =20 while ((group =3D iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, now, true); } =20 void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -858,6 +847,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, =20 if (prev->pid) { int clear =3D TSK_ONCPU, set =3D 0; + bool wake_clock =3D true; =20 /* * When we're going to sleep, psi_dequeue() lets us @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct= task_struct *next, clear |=3D TSK_MEMSTALL_RUNNING; if (prev->in_iowait) set |=3D TSK_IOWAIT; + + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((prev->flags & PF_WQ_WORKER) && + wq_worker_last_func(prev) =3D=3D psi_avgs_work)) + wake_clock =3D false; } =20 psi_flags_change(prev, clear, set); =20 iter =3D NULL; while ((group =3D iterate_groups(prev, &iter)) && group !=3D common) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); =20 /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3ADD7C433EF for ; Thu, 21 Jul 2022 04:05:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231607AbiGUEFR (ORCPT ); Thu, 21 Jul 2022 00:05:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43864 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231514AbiGUEFI (ORCPT ); Thu, 21 Jul 2022 00:05:08 -0400 Received: from mail-pg1-x529.google.com (mail-pg1-x529.google.com [IPv6:2607:f8b0:4864:20::529]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 28CD1BEB for ; Wed, 20 Jul 2022 21:05:04 -0700 (PDT) Received: by mail-pg1-x529.google.com with SMTP id o18so511754pgu.9 for ; Wed, 20 Jul 2022 21:05:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dTUtciELDeUFP9YP7rSaShsOCbzI7poCIUEZlQHjgPU=; b=X7slEwVNzOA4SvR6+UzacobU5Ug/dz94gOEN7pAFTdSNd9xZdz08+zlSKHsdwOLoRt iNyy6n95zaI/xRzVsBx3OCK/CTa5pJqYfIYD5e9q+1R6fVh51E5EZrqDYAx7mP4hCYm4 MWPqJeGu0OLiIahmXD4tNDzAmeTYF+OdFLenYGIDq5uqoUPy7gXO8lRzmABO9Va4i/GI 2+1ExmHAKB+79axJoSXTeCQWRcBoMO40LGYC1IbEn6N3pXe2R14PnVUwO5ocDeGDO42o p6FPyugaUju6EfK1YwI+UYIBih58EYtCIRrNp/BsqYBsp1HfDTfujG7dgK6PxPbGb4UE 9ZRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dTUtciELDeUFP9YP7rSaShsOCbzI7poCIUEZlQHjgPU=; b=h1mgln95mxzEhBGiJ1I6kDp442VY81s69FCvPCfW+EyjP4ObLJR/bUwxkQksNWOZPW HjmLT0XGVsPvAUfpJtGJwc/WOAVT7oJ3Mp6MYkuMgmDI6I2QXFGY6ofv0TXmVLwgCXEa CKG6lraToZKwJK0yDcfOE20fdFucuMY2E3wfYwZ9R6g6/GpTG1vjn+IxwZlQP5yBUsu7 M8eLiV328so/Vu/TVgdoSQL9ToeUmG7mZCigwDQkXC2dNBqkTsV/Nu9jhw/x5KnhnarV M78j1m+ESBp8ujLFaU/0o7OuIDDqW5gwA15AuxmBye+WQO/7XjQVeJCOvNMZ8m0MoGfA SNbQ== X-Gm-Message-State: AJIora+NCD3H5tI+OQsvXAV1upSNU4FtcSgbLJYzSoS/uexGj5Xb40t2 LICiWYeMX/+n+bh3vfPL7+uelA== X-Google-Smtp-Source: AGRyM1smn8TNFzzfVUfamTsiOL5mupljniycDdAUTBCeuA1EAjNOQSsmxlk+4rCRs4MVfn5XtRN4Eg== X-Received: by 2002:a63:c15:0:b0:411:f92a:8ec7 with SMTP id b21-20020a630c15000000b00411f92a8ec7mr36026236pgl.86.1658376303964; Wed, 20 Jul 2022 21:05:03 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.04.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:03 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 2/9] sched/psi: optimize task switch inside shared cgroups again Date: Thu, 21 Jul 2022 12:04:32 +0800 Message-Id: <20220721040439.2651-3-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer prev task sleep handling to psi_task_switch(), so we don't need to clear and set TSK_ONCPU state for common cgroups. A | B / \ C D / \ prev next After that commit psi_task_switch() do: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D 2. psi_group_change(prev, .clear=3DTSK_ONCPU | TSK_RUNNING) for C 3. psi_group_change(prev, .clear=3DTSK_RUNNING) for B, A But there is a limitation "prev->psi_flags =3D=3D next->psi_flags" that if not satisfied, will make this cgroups optimization unusable for both sleep switch or running switch cases. For example: prev->in_memstall !=3D next->in_memstall when sleep switch: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=3DTSK_ONCPU | TSK_RUNNING) for C, B, A prev->in_memstall !=3D next->in_memstall when running switch: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=3DTSK_ONCPU) for C, B, A The reason why this limitation exist is that we consider a group is PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive could run even if it were runnable. So when CPU curr changed from prev to next and their in_memstall status is different, we have to change PSI_MEM_FULL status for their common cgroups. This patch remove this limitation by making psi_group_change() change PSI_MEM_FULL status depend on CPU curr->in_memstall status. Signed-off-by: Chengming Zhou --- kernel/sched/psi.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index c8a4e644cd2c..e04041d8251b 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, u64 now =3D cpu_clock(cpu); =20 if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* * When switching between tasks that have an identical @@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - identical_state =3D prev->psi_flags =3D=3D next->psi_flags; iter =3D NULL; while ((group =3D iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common =3D group; break; } @@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, * TSK_ONCPU is handled up to the common ancestor. If we're tasked * with dequeuing too, finish that for the rest of the hierarchy. */ - if (sleep) { + if (sleep || unlikely(prev->in_memstall !=3D next->in_memstall)) { clear &=3D ~TSK_ONCPU; for (; group; group =3D iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, true); --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0EFF4C43334 for ; Thu, 21 Jul 2022 04:05:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231641AbiGUEFY (ORCPT ); Thu, 21 Jul 2022 00:05:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44220 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231636AbiGUEFO (ORCPT ); Thu, 21 Jul 2022 00:05:14 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB62AE9F for ; Wed, 20 Jul 2022 21:05:09 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id f11so675587plr.4 for ; Wed, 20 Jul 2022 21:05:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=WvvOz2ooNKkLxp1jXKzvkeS5wEgSxLKRgm307TatvVA=; b=ckpgDrLr6CFyxohwrqrpXwyXKp11JVoiQ5DiMMTylXQa6I8xWS3YYhh2dta4ECXV7Y vTIbxjVYUENY381COkF0BRbUPBUYVORJWIqY2OUWTgTWOOAxKfAZu6dl8Lz3Y4A1NKFD jlijrV7S3zgzVs/RUtm7sH9g4u2y3/w7UIAoU7BtZqVC99Fv7S2494Bu9tR3W0xK+vHX ZaBd3K32+MGH5I/mM7pSXZklcuCPk6DY8Q/gZefUKml2rPT3teNNVZ+od0kPRqPXwkHB 3fk6nKf974fgJYcPhIIJoGu8eaHUnYL9J15Y6ZowtSnP6E0bzgwAs5z5JVE0QALCyNoV /pvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=WvvOz2ooNKkLxp1jXKzvkeS5wEgSxLKRgm307TatvVA=; b=ZPYOHd8/S4pf8E4smMc7440jqmknMwtiw6xeT6VQbDDzaG+j9kpKdnrGEv//WrV2O2 3cCC0SbwjFctb++bfZVcxbsNfZTel78Dttd4Bs/TGBG7cFLVRSVIJcLNgcqWAwQAafUf uK7rjceQ2mli23GJiLAXrC6xG0jjgwAdy/pmW4jUE3Rce8DI8bxXQ2GXJR4YEVEcd3ob EbRQQIawIoUNRxj5Xwqjfc1qZYFf0GkGxkjiUSAN9tcRBZt284pm2SsChqJYIMeHAW+Q giRG9G66Swj84a9kbmtIEMDCZk18hSf0oxcu13tE1Q/niGviiOD373BEQxoNzKuYXeHb 61WQ== X-Gm-Message-State: AJIora/HXOtZzd/NAP7qkwsyQ00clOIgqIrCWvl/01+LCTs0Yo0z+tfq t9wU3PxUrnb3Ba599t0U6A97Pw== X-Google-Smtp-Source: AGRyM1sTf0wPzfxVI0+e4iyKEpi2Pmdf28lKPn7ux0IdbOj3q31jhmFOwSQ0nuCDaFIu5xd/qhUvhA== X-Received: by 2002:a17:90b:17c1:b0:1f0:1fc9:bcc7 with SMTP id me1-20020a17090b17c100b001f01fc9bcc7mr9332859pjb.53.1658376309176; Wed, 20 Jul 2022 21:05:09 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.05.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:08 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 3/9] sched/psi: move private helpers to sched/stats.h Date: Thu, 21 Jul 2022 12:04:33 +0800 Message-Id: <20220721040439.2651-4-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- include/linux/psi.h | 4 ---- kernel/sched/stats.h | 4 ++++ 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index 89784763d19e..aa168a038242 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -18,10 +18,6 @@ extern struct psi_group psi_system; =20 void psi_init(void); =20 -void psi_task_change(struct task_struct *task, int clear, int set); -void psi_task_switch(struct task_struct *prev, struct task_struct *next, - bool sleep); - void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); =20 diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index baa839c1ba96..c39b467ece43 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se) } =20 #ifdef CONFIG_PSI +void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); + /* * PSI tracks state that persists across sleeps, such as iowaits and * memory stalls. As a result, it has to distinguish between sleeps, --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9BF65C43334 for ; Thu, 21 Jul 2022 04:05:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231688AbiGUEF3 (ORCPT ); Thu, 21 Jul 2022 00:05:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44398 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231689AbiGUEFS (ORCPT ); Thu, 21 Jul 2022 00:05:18 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8816ADFB7 for ; Wed, 20 Jul 2022 21:05:14 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id 70so655407pfx.1 for ; Wed, 20 Jul 2022 21:05:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=12xG6HFS4OSNQZJgcaaRecjswCM/r3YsLGrFXT03SBs=; b=JEvEzm4b+zcLaZwlGUiN8uUg5U4dgKq9px8l7ozjtYZO7DI8BBi184K34p2VDbFE49 pM+qjrqpIrvkx4+EDe3g2XVKBqX1dTIZsjztUVcn9tkTKeOyfXWDbAkRetmWq5FTcb2Q +GUVuwIW6UnTAFLNC7k1ONRyHcqjAYYuAGT9zQ7IGNtJ+ppayBfzv1qBpS939kqj4tln 7B2oixjMZCtam8tbl6gSdYETIglRYALB5WjyetEC3yynQYDY/3i5lPX6DD3WuJVlD/n1 l5ySyrUO0G6XUpkb0CWi0jqYo/kme4VOXDoKeuVQQ29sjWR7TtQjEsYVdoG9N54CbKkl gsyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=12xG6HFS4OSNQZJgcaaRecjswCM/r3YsLGrFXT03SBs=; b=Z0uzh9W7Wgsb0pPy4sIeZ7ZsYzpUh/fTm7A/lzD3Mq3S9A8rPLYYiyu9vGOfypCBP6 WX3nYVbF3eMmyrMdM72sMwIehWG9ysG9T6nqTXhHMMQVrzhxQ8OD3afKzVVr/Fci6mpr lMkRWxpXlVX/ZsI1MGIPC4oXPp/Vt/h/rDIuBdaDMIQb9uC2epQNwT4tkiLp4zyACQXH 2mMj8zNTdDmRyTqwra1w9oERMgVwylatulgCb2Q1W6+sg2RpzgbMSbaLVaY8pvIwZFu3 Jf12Yms5oI93yGIVnAa+ZwMUbane4D7evHmlz63mZGRL5eGmqi0wSAublYl0zz5Hqjvg Llaw== X-Gm-Message-State: AJIora+raWQh9C/tqhI1nemIRyCo1SL+boRBujsTDw+Mki4Tutp45nen hYdq+3otr6bVmAsoGbwkpgFVAQ== X-Google-Smtp-Source: AGRyM1vzbxkuA/FTqeo6u/RdfuI10CNeR889oiD7WWMCed4ORc0Vc7IaL03+eUJ7UVcpuamVYhuOhw== X-Received: by 2002:a05:6a00:150d:b0:52b:1ffb:503c with SMTP id q13-20020a056a00150d00b0052b1ffb503cmr36049121pfu.44.1658376314402; Wed, 20 Jul 2022 21:05:14 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.05.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:14 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 4/9] sched/psi: don't change task psi_flags when migrate CPU/group Date: Thu, 21 Jul 2022 12:04:34 +0800 Message-Id: <20220721040439.2651-5-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The current code use psi_task_change() at every scheduling point, in which change task psi_flags then change all its psi_groups. So we have to heavily rely on the task scheduling states to calculate what to set and what to clear at every scheduling point, which make the PSI stats tracking code much complex and error prone. In fact, the task psi_flags only change at wakeup and sleep (except ONCPU state at switch), it doesn't change at all when migrate CPU/group. If we keep its psi_flags unchanged when migrate CPU/group, we can just use task->psi_flags to clear(migrate out) or set(migrate in), which will make PSI stats tracking much simplier and more efficient. Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't include wakeup new task, so add psi_enqueue() in wake_up_new_task(). Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.034 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.210 [sec] 8.210600 usecs/op 121793 ops/sec 2. after the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.077 [sec] 8.077648 usecs/op 123798 ops/sec Signed-off-by: Chengming Zhou --- include/linux/sched.h | 3 --- kernel/sched/core.c | 1 + kernel/sched/psi.c | 24 ++++++++++--------- kernel/sched/stats.h | 54 +++++++++++++++++++++---------------------- 4 files changed, 40 insertions(+), 42 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 88b8817b827d..20a94786cad8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -879,9 +879,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; -#ifdef CONFIG_PSI - unsigned sched_psi_wake_requeue:1; -#endif =20 /* Force alignment to the next boundary: */ unsigned :0; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a463dbc92fcd..f5f2d3542b05 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p) post_init_entity_util_avg(p); =20 activate_task(rq, p, ENQUEUE_NOCLOCK); + psi_enqueue(p, true); trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); #ifdef CONFIG_SMP diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index e04041d8251b..6ba159fe2a4f 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task= , int clear, int set) task->psi_flags |=3D set; } =20 -void psi_task_change(struct task_struct *task, int clear, int set) +void psi_change_groups(struct task_struct *task, int clear, int set) { int cpu =3D task_cpu(task); struct psi_group *group; void *iter =3D NULL; - u64 now; + u64 now =3D cpu_clock(cpu); + + while ((group =3D iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set, now, true); +} =20 +void psi_task_change(struct task_struct *task, int clear, int set) +{ if (!task->pid) return; =20 psi_flags_change(task, clear, set); - - now =3D cpu_clock(cpu); - - while ((group =3D iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, true); + psi_change_groups(task, clear, set); } =20 void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struc= t css_set *to) * pick_next_task() * rq_unlock() * rq_lock() - * psi_task_change() // old cgroup + * psi_change_groups() // old cgroup * task->cgroups =3D to - * psi_task_change() // new cgroup + * psi_change_groups() // new cgroup * rq_unlock() * rq_lock() * psi_sched_switch() // does deferred updates in new cgroup @@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, str= uct css_set *to) task_flags =3D task->psi_flags; =20 if (task_flags) - psi_task_change(task, task_flags, 0); + psi_change_groups(task, task_flags, 0); =20 /* See comment above */ rcu_assign_pointer(task->cgroups, to); =20 if (task_flags) - psi_task_change(task, 0, task_flags); + psi_change_groups(task, 0, task_flags); =20 task_rq_unlock(rq, task, &rf); } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index c39b467ece43..e930b8fa6253 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se) } =20 #ifdef CONFIG_PSI +void psi_change_groups(struct task_struct *task, int clear, int set); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); @@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p,= bool wakeup) if (static_branch_likely(&psi_disabled)) return; =20 - if (p->in_memstall) - set |=3D TSK_MEMSTALL_RUNNING; + if (!wakeup) { + if (p->psi_flags) + psi_change_groups(p, 0, p->psi_flags); + return; + } =20 - if (!wakeup || p->sched_psi_wake_requeue) { - if (p->in_memstall) + /* + * wakeup (including wakeup migrate) need to change task psi_flags, + * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING. + * Since we clear task->psi_flags for wakeup migrated task, we need + * to check task->psi_flags to see what should be set and clear. + */ + if (unlikely(p->in_memstall)) { + set |=3D TSK_MEMSTALL_RUNNING; + if (!(p->psi_flags & TSK_MEMSTALL)) set |=3D TSK_MEMSTALL; - if (p->sched_psi_wake_requeue) - p->sched_psi_wake_requeue =3D 0; - } else { - if (p->in_iowait) - clear |=3D TSK_IOWAIT; } + if (p->psi_flags & TSK_IOWAIT) + clear |=3D TSK_IOWAIT; =20 psi_task_change(p, clear, set); } =20 static inline void psi_dequeue(struct task_struct *p, bool sleep) { - int clear =3D TSK_RUNNING; - if (static_branch_likely(&psi_disabled)) return; =20 + if (!sleep) { + if (p->psi_flags) + psi_change_groups(p, p->psi_flags, 0); + return; + } + /* * A voluntary sleep is a dequeue followed by a task switch. To * avoid walking all ancestors twice, psi_task_switch() handles * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU. * Do nothing here. */ - if (sleep) - return; - - if (p->in_memstall) - clear |=3D (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING); - - psi_task_change(p, clear, 0); } =20 static inline void psi_ttwu_dequeue(struct task_struct *p) @@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struc= t *p) /* * Is the task being migrated during a wakeup? Make sure to * deregister its sleep-persistent psi states from the old - * queue, and let psi_enqueue() know it has to requeue. + * queue. */ - if (unlikely(p->in_iowait || p->in_memstall)) { + if (unlikely(p->psi_flags)) { struct rq_flags rf; struct rq *rq; - int clear =3D 0; - - if (p->in_iowait) - clear |=3D TSK_IOWAIT; - if (p->in_memstall) - clear |=3D TSK_MEMSTALL; =20 rq =3D __task_rq_lock(p, &rf); - psi_task_change(p, clear, 0); - p->sched_psi_wake_requeue =3D 1; + psi_task_change(p, p->psi_flags, 0); __task_rq_unlock(rq, &rf); } } --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEA2FC43334 for ; Thu, 21 Jul 2022 04:05:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231788AbiGUEFt (ORCPT ); Thu, 21 Jul 2022 00:05:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231745AbiGUEFX (ORCPT ); Thu, 21 Jul 2022 00:05:23 -0400 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5E7D013DC1 for ; Wed, 20 Jul 2022 21:05:20 -0700 (PDT) Received: by mail-pj1-x102c.google.com with SMTP id f3-20020a17090ac28300b001f22d62bfbcso642495pjt.0 for ; Wed, 20 Jul 2022 21:05:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=iTBK+C00vRPG4JDueYyv7EqFsWvxIqsN3waJE+O/isA=; b=QHIvOf/vF7cxWQOSppu23p6vElDHp0+ZqYuQj3IE6sYMS6UGWgb8VpdRJEDv/3fmgK EBsyNAQs2eSnCL8T+CYX5Ge6nLSOCZhoT7717mCb3PYoGzDPZvxo0RttCkMybvV+RiqR 7lSJrgu7/rcXtElobuMaOqMSQk4MZHbvh0tkYM8/N8UBVTGiNPK+sk0l0Q1No0nSuCne +ICLWxXj78VcKThofLCbfIwAuz7LBngg0/wfa0zSvrleB7AUxs7BkQP92oCOqUvaFR9q 96tnMKzLbOOur2qTlrlpvcq9bAVMyNMH61VFRD89X5LBADuZX1NXdhUgyrLX/EYl/Du0 N72g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=iTBK+C00vRPG4JDueYyv7EqFsWvxIqsN3waJE+O/isA=; b=624aFB/dSjbcIaEmJ8h8dqpr7EIZTEkmeMLT4N0PutDLNFzBJnyg6Dy5Pj1q+1Wg3F Tq+Vp0oyYuomMlxcG+LGs+0hIEa9rHGrHhqds0+Q8D1Gux5HHtb0YOGDbtq7jtynJ0EK i+KT/kxCR+GGsnnzvinvXBNww4P1idjVeky5StGUqv6aYF3RuiQihI9MszvKgTt6TU25 uRg954fFf+8oqeo3MFr2VBsEXDSrkmyp2IegSP2W/O8jkqlTqUQ4DIp3ZCTuPcLpb3mi E2reNWXwuSibWJbxQYqZFj53x2edb4bCaVIFDcL+gi0rYlHdZFADeTmMOv98SzKNoGZF WNhA== X-Gm-Message-State: AJIora9vm91d6xFDRuAuF2vMfsU5YhWCz46C/mN6HMP0jpW27zi8JstZ PEbxFoi7s3fcnsd9hjsorLZBYw== X-Google-Smtp-Source: AGRyM1tD8ydLFbzp0WwR7EaNXDsuFnSu/q0Me8fNNp8ZZt3K7hT+RDgBX5aStMP8YvI5v0abrO21vQ== X-Received: by 2002:a17:903:44b:b0:16c:b112:7f74 with SMTP id iw11-20020a170903044b00b0016cb1127f74mr35436361plb.149.1658376319603; Wed, 20 Jul 2022 21:05:19 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.05.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:19 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 5/9] sched/psi: don't create cgroup PSI files when psi_disabled Date: Thu, 21 Jul 2022 12:04:35 +0800 Message-Id: <20220721040439.2651-6-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking confi= gurable") make PSI can be configured to skip per-cgroup stall accounting. And doesn't expose PSI files in cgroup hierarchy. This patch do the same thing when psi_disabled. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/cgroup/cgroup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1779ccddb734..1424da7ed2c4 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3700,6 +3700,9 @@ static void cgroup_pressure_release(struct kernfs_ope= n_file *of) =20 bool cgroup_psi_enabled(void) { + if (static_branch_likely(&psi_disabled)) + return false; + return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) =3D=3D= 0; } =20 --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BC392C433EF for ; Thu, 21 Jul 2022 04:06:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231963AbiGUEGE (ORCPT ); Thu, 21 Jul 2022 00:06:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231810AbiGUEF1 (ORCPT ); Thu, 21 Jul 2022 00:05:27 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31E5719029 for ; Wed, 20 Jul 2022 21:05:25 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id g12so649305pfb.3 for ; Wed, 20 Jul 2022 21:05:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=XAbsx4z61PDmKxVrG+hWql1tanJuLJy0FXV+SbeVNSE=; b=n3xRXJ18Eab7FaOtwX3iQ8M5Fy8jiqky4vI6yYmrue1hP+DLKahrynn7eC45oq8pwv Rzb4EDX1fZFqqRvmOKFgP/BR33FOpvxGhPiKXyDA59rd8GTuagmsSEN4m/r2JxeZdcRZ m4vrsmTlWy1bdAyu70rUvXfq/7GqBOf61jVfLd3wsd+b8A+Wfrbj18dnl8slpV268XN7 KprkkPcc+QRurpYYTV3ecrMQlGjoFWfs5Fk3SX+DRQwqW3d4CUUkHwSCwE+tuN3eXOVk NF3+BfTJt2R2yBh3NVJ3knbcNeIoqjdDdd4qcM1Q5aEnNtD/XFHRN7T6vt+6EAvJE/1H k+AQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=XAbsx4z61PDmKxVrG+hWql1tanJuLJy0FXV+SbeVNSE=; b=vho8DLw7XCzCuWT0TqQF7dOSpozXkYS1X/lfcfjVbe6+SH3AclEbFj1nat8MhaXp1M Kxs3cemQ51cTPT2+JQECn8OQSVpIqMUoi/NeO14SFoG9Q3t8eMyo/sWZajHK6nENimq6 LCXYlFncYpTehzjjDj0jTB4HbMZe7meNMYNyDndSyA0Pu+p/8meyf/vbGqQAbDXuxw41 k+2YvcsJvJj6J8lfZThXVxW6xqWnS7XFC85RK0/ZSELuDZUI4Rb8ppg5SRPKeH9rmhGy x1QOB88VdC6Lbhv/lqhV5qUuxIu6whAB6nin9XfCPgZ8b1Q4Z3oNAtblsdEr6LFj5eh4 zqXw== X-Gm-Message-State: AJIora9FSyVqEPfuLT898ITq7RycVLdlaGo670i5AGYcU4VR/UvFE/0C sklyht685FGOVjouuDfT6fXVvg== X-Google-Smtp-Source: AGRyM1t1b8tjCTGsXmfFwxPrKRYVz+vLBtMewzagJZyz+HIiUE8b0Jq9QUpdHQd4dl7IWbKIm0soVw== X-Received: by 2002:a65:6e41:0:b0:412:4c1f:9936 with SMTP id be1-20020a656e41000000b004124c1f9936mr37332444pgb.455.1658376324638; Wed, 20 Jul 2022 21:05:24 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.05.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:24 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 6/9] sched/psi: save percpu memory when !psi_cgroups_enabled Date: Thu, 21 Jul 2022 12:04:36 +0800 Message-Id: <20220721040439.2651-7-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We won't use cgroup psi_group when !psi_cgroups_enabled, so don't bother to alloc percpu memory and init for it. Also don't need to migrate task PSI stats between cgroups in cgroup_move_task(). Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 6ba159fe2a4f..aa40bf888102 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -205,6 +205,7 @@ void __init psi_init(void) { if (!psi_enable) { static_branch_enable(&psi_disabled); + static_branch_disable(&psi_cgroups_enabled); return; } =20 @@ -952,7 +953,7 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; =20 cgroup->psi.pcpu =3D alloc_percpu(struct psi_group_cpu); @@ -964,7 +965,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) =20 void psi_cgroup_free(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return; =20 cancel_delayed_work_sync(&cgroup->psi.avgs_work); @@ -991,7 +992,7 @@ void cgroup_move_task(struct task_struct *task, struct = css_set *to) struct rq_flags rf; struct rq *rq; =20 - if (static_branch_likely(&psi_disabled)) { + if (!static_branch_likely(&psi_cgroups_enabled)) { /* * Lame to do this here, but the scheduler cannot be locked * from the outside, so we move cgroups from inside sched/. --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C6F32C43334 for ; Thu, 21 Jul 2022 04:06:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231849AbiGUEGG (ORCPT ); Thu, 21 Jul 2022 00:06:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45082 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231689AbiGUEFb (ORCPT ); Thu, 21 Jul 2022 00:05:31 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 472F6267 for ; Wed, 20 Jul 2022 21:05:30 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id g12so649453pfb.3 for ; Wed, 20 Jul 2022 21:05:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=nv+1NkEw7UxRZmLxeUo4cV9wyQ4QwqO39uBVndf0jCc=; b=bjY6/TWcojIR7yKS1J162GCjohPw3F5UgNwyHYopIXrwMDDhzYOMpRFhjZVlve2jnH kwdjjXY0qDCbrgsz9WqbVpRuWZ/Nk7Cif76tkZSmkDPqpohm0J91WwUChBE04lhb8UU6 bMa1rinjFXXBuHRFDZOwH0rNyd98UByXccx+txf36+XCjX5EV3NqhMqC3PlNEnfNZQy4 sHCEu0mXQGJwGoB3sYRlhw6YgYO8xUYWKpfM52re2Nmhn49YQgpmwu1sxezhoEs33P10 4N27susZxutNT6NYKpZM7N3Cvs6ft7Km3OvrjajXEjV79pIlZYZeeEzOl/rfVJwGi1Mp 499g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=nv+1NkEw7UxRZmLxeUo4cV9wyQ4QwqO39uBVndf0jCc=; b=gIX0Rkr6bqpybyKG5l1vJMUuqfYaTJHLZBl7l1IPNABXuFN0/PaSJ29aydSqDeyAu1 ynUFIsNw+SgZbeGzvqohvHiprHKFVgVd41jcKjvgEQcLDdQNfmoIA213sH+6F1rTsoyE 0agTc3hDwpAg885yvyFg4rdNd4Bu9oA43mAnze5TN1KABWLls02IVwOvBjOQyT07+qcl +LCYVR/L9Eb5BXu1V6mXJQzERqK8CEoVGz9kjAGThtixAzFQUyx9HluNJhPg3ISj1jj1 6KLCgL6mSp73U51wgNbLefNNV6+Eegp/W9Fbj3BAjIbGY0eajyKIZxiz9Xa5LURznA6y 3tew== X-Gm-Message-State: AJIora+qzOxma8SXwn5s9Ac3orNM+RFeO9AcBbFofCGtCGSnTF+cTCBI lXZ66F5UjbtctDG8eNnekNUtlA== X-Google-Smtp-Source: AGRyM1sKSRpfwuk39oniqXz1NHdVDkTE3JxCDh4X7ic+6gMr/6ie6K4T8h8l/Eo1heVC6xfCAUhd5A== X-Received: by 2002:a62:7b57:0:b0:52a:bb3b:21fb with SMTP id w84-20020a627b57000000b0052abb3b21fbmr42255417pfc.21.1658376329881; Wed, 20 Jul 2022 21:05:29 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.05.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:29 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 7/9] sched/psi: cache parent psi_group to speed up groups iterate Date: Thu, 21 Jul 2022 12:04:37 +0800 Message-Id: <20220721040439.2651-8-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We use iterate_groups() to iterate each level psi_group to update PSI stats, which is a very hot path. In current code, iterate_groups() have to use multiple branches and cgroup_parent() to get parent psi_group for each level, which is not very efficient. This patch cache parent psi_group, only need to get psi_group of task itself first, then just use group->parent to iterate. And this patch is preparation for the following patch, in which we can configure PSI to only account for leaf cgroups and system-wide. Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.077 [sec] 8.077648 usecs/op 123798 ops/sec 2. after the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.758 [sec] 7.758354 usecs/op 128893 ops/sec Signed-off-by: Chengming Zhou --- include/linux/psi_types.h | 2 ++ kernel/sched/psi.c | 48 ++++++++++++++++++++------------------- 2 files changed, 27 insertions(+), 23 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c7fe7c089718..c124f7d186d0 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -147,6 +147,8 @@ struct psi_trigger { }; =20 struct psi_group { + struct psi_group *parent; + /* Protects data used by the aggregator */ struct mutex avgs_lock; =20 diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index aa40bf888102..2228cbf3bdd3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -758,30 +758,22 @@ static void psi_group_change(struct psi_group *group,= int cpu, schedule_delayed_work(&group->avgs_work, PSI_FREQ); } =20 -static struct psi_group *iterate_groups(struct task_struct *task, void **i= ter) +static inline struct psi_group *task_psi_group(struct task_struct *task) { - if (*iter =3D=3D &psi_system) - return NULL; - #ifdef CONFIG_CGROUPS if (static_branch_likely(&psi_cgroups_enabled)) { - struct cgroup *cgroup =3D NULL; - - if (!*iter) - cgroup =3D task->cgroups->dfl_cgrp; - else - cgroup =3D cgroup_parent(*iter); + struct cgroup *cgroup =3D task_dfl_cgroup(task); =20 - if (cgroup && cgroup_parent(cgroup)) { - *iter =3D cgroup; + if (cgroup && cgroup_parent(cgroup)) return cgroup_psi(cgroup); - } } #endif - *iter =3D &psi_system; return &psi_system; } =20 +#define for_each_psi_group(group) \ + for (; group; group =3D group->parent) + static void psi_flags_change(struct task_struct *task, int clear, int set) { if (((task->psi_flags & set) || @@ -799,12 +791,11 @@ static void psi_flags_change(struct task_struct *task= , int clear, int set) =20 void psi_change_groups(struct task_struct *task, int clear, int set) { + struct psi_group *group =3D task_psi_group(task); int cpu =3D task_cpu(task); - struct psi_group *group; - void *iter =3D NULL; u64 now =3D cpu_clock(cpu); =20 - while ((group =3D iterate_groups(task, &iter))) + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } =20 @@ -822,7 +813,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, { struct psi_group *group, *common =3D NULL; int cpu =3D task_cpu(prev); - void *iter; u64 now =3D cpu_clock(cpu); =20 if (next->pid) { @@ -833,8 +823,8 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - iter =3D NULL; - while ((group =3D iterate_groups(next, &iter))) { + group =3D task_psi_group(next); + for_each_psi_group(group) { if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common =3D group; break; @@ -874,9 +864,12 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, =20 psi_flags_change(prev, clear, set); =20 - iter =3D NULL; - while ((group =3D iterate_groups(prev, &iter)) && group !=3D common) + group =3D task_psi_group(prev); + for_each_psi_group(group) { + if (group =3D=3D common) + break; psi_group_change(group, cpu, clear, set, now, wake_clock); + } =20 /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -884,7 +877,8 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, */ if (sleep || unlikely(prev->in_memstall !=3D next->in_memstall)) { clear &=3D ~TSK_ONCPU; - for (; group; group =3D iterate_groups(prev, &iter)) + + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } } @@ -953,6 +947,8 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { + struct cgroup *parent; + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; =20 @@ -960,6 +956,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (!cgroup->psi.pcpu) return -ENOMEM; group_init(&cgroup->psi); + + parent =3D cgroup_parent(cgroup); + if (parent && cgroup_parent(parent)) + cgroup->psi.parent =3D cgroup_psi(parent); + else + cgroup->psi.parent =3D &psi_system; return 0; } =20 --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CE158C43334 for ; Thu, 21 Jul 2022 04:06:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230398AbiGUEGQ (ORCPT ); Thu, 21 Jul 2022 00:06:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231857AbiGUEFk (ORCPT ); Thu, 21 Jul 2022 00:05:40 -0400 Received: from mail-pl1-x62a.google.com (mail-pl1-x62a.google.com [IPv6:2607:f8b0:4864:20::62a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A983D26AE4 for ; Wed, 20 Jul 2022 21:05:35 -0700 (PDT) Received: by mail-pl1-x62a.google.com with SMTP id z3so689474plb.1 for ; Wed, 20 Jul 2022 21:05:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=chxgxgcpZm63zsd7qnwFUE6j0UUhwVxho98X19IjMdE=; b=MEqNBgP7agly9aS2Azra3w7VOxdjKrzITB65iYsv3NXbBBWm7HIegWxDU49K4BpbH1 upu756UW+/g2v32lOG9VTubX3HCXZh4vdPmCLXyUcqTpp7LbUW3NBWRJB6sZ1dEz+9ax jLNPgIjfXX6x9qBkQEK+58do+zhoAlNUgAX7FEaaWcVrKLvW/1xA/CB4mwIZAR9wDsMm afAYO+RCU73xoMk/QP/DkmQSf3RstJQzsQzDCn07vnpAKmP8Y9Jyd9PXnblNfbBeLgiE ZFyMRwUN+WkXIe4M195SuIyrHTNTA04IEImzh2fl5Myw1YCVj/LGvPfEEQsF2ARkZEwa hbYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=chxgxgcpZm63zsd7qnwFUE6j0UUhwVxho98X19IjMdE=; b=WiehgBJWk4AwJ2opdfGlVwrLt6atqe4DXddxw/rm2GHZ0ioH9mwoRNXDwCHG/cSMD6 9Pz3i95Gqu6n08WR8iIt1wv9zqj7H6YmZdn5rfllA+wwDg/QRKHyT7dkar/pn94HtXtm pCwrW+1WQZN5PUwceEwcQ/DsSL/hMGUdtD4zg1TuqzO/Yyn2eTV+j8FMlPWQjjjpK0cW qi0GA1pIp3AJfvOrWblAULfwXnxkK1suZ8MgAcInc/4Ax+Pz/QBWsns7ID1q45tFpC5z tPFjA6jGTHhtxGKcgR8wW8Dg8cnol7/0vIZ25dMXgohx5LlkB6VgW8t2PVw8V6DSIybV gd/Q== X-Gm-Message-State: AJIora+KH2BBtlv5EmuHlZy4xHAZpGy6wEFAXsbFBQYrYYDlVYpPHSfm B4bO5z3Gz2EW4ShjJ5fPj27YMg== X-Google-Smtp-Source: AGRyM1vUIz6lPEf3HOHUkuvCGIIMxbk0t9sGoJJTfg14bbj82E++1uIgwhKvaCZx0C4rl6QULiYhTg== X-Received: by 2002:a17:902:d552:b0:16d:33d2:955b with SMTP id z18-20020a170902d55200b0016d33d2955bmr177647plf.29.1658376335154; Wed, 20 Jul 2022 21:05:35 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.05.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:34 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 8/9] sched/psi: add kernel cmdline parameter psi_inner_cgroup Date: Thu, 21 Jul 2022 12:04:38 +0800 Message-Id: <20220721040439.2651-9-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may case non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking confi= gurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. For our use case, we also want leaf cgroup PSI accounted for userspace adjustment on that cgroup, apart from only system-wide management. So this patch add kernel cmdline parameter "psi_inner_cgroup" to control whether or not to account for inner cgroups, which is default to true for compatibility. Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. default (psi_inner_cgroup=3Dtrue) $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.758 [sec] 7.758354 usecs/op 128893 ops/sec 2. psi_inner_cgroup=3Dfalse $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 7.309 [sec] 7.309436 usecs/op 136809 ops/sec Signed-off-by: Chengming Zhou --- Documentation/admin-guide/kernel-parameters.txt | 6 ++++++ kernel/sched/psi.c | 11 ++++++++++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentatio= n/admin-guide/kernel-parameters.txt index 8090130b544b..6beef5b8bc36 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4419,6 +4419,12 @@ tracking. Format: =20 + psi_inner_cgroup=3D + [KNL] Enable or disable pressure stall information + tracking for the inner cgroups. + Format: + default: enabled + psmouse.proto=3D [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate=3D [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 2228cbf3bdd3..8d76920f47b3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -147,12 +147,21 @@ static bool psi_enable; #else static bool psi_enable =3D true; #endif + +static bool psi_inner_cgroup __read_mostly =3D true; + static int __init setup_psi(char *str) { return kstrtobool(str, &psi_enable) =3D=3D 0; } __setup("psi=3D", setup_psi); =20 +static int __init setup_psi_inner_cgroup(char *str) +{ + return kstrtobool(str, &psi_inner_cgroup) =3D=3D 0; +} +__setup("psi_inner_cgroup=3D", setup_psi_inner_cgroup); + /* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ @@ -958,7 +967,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) group_init(&cgroup->psi); =20 parent =3D cgroup_parent(cgroup); - if (parent && cgroup_parent(parent)) + if (parent && cgroup_parent(parent) && psi_inner_cgroup) cgroup->psi.parent =3D cgroup_psi(parent); else cgroup->psi.parent =3D &psi_system; --=20 2.36.1 From nobody Fri Apr 17 22:31:13 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 234A7C43334 for ; Thu, 21 Jul 2022 04:06:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232008AbiGUEGc (ORCPT ); Thu, 21 Jul 2022 00:06:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45544 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231920AbiGUEFt (ORCPT ); Thu, 21 Jul 2022 00:05:49 -0400 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 01F78313A5 for ; Wed, 20 Jul 2022 21:05:40 -0700 (PDT) Received: by mail-pj1-x102b.google.com with SMTP id p6-20020a17090a680600b001f2267a1c84so2321568pjj.5 for ; Wed, 20 Jul 2022 21:05:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=IajutrYR1gNldwnX+rwhURWEvmgjhBqy7g0eKQMrBVs=; b=UnxLTwZRABkrK5soWvqW5KuugHBHwTDqjNtjq9rDE9oDCJT2dmpwUaBzxANv/yqa3W Vw665jPOgVDi3e2/19NQL1VJrgoHHZN7tq9+MO15GHDsqb1kMKFZbLYYGpHcmelECSyt jaKalYKYWN0QBeF8mpL99kV419pD4QLzFhdQPvoxMeuQBz2YKEcVHFOhd+ijmnSHCU4j LD+jT1Ib06Vdb1zsIsHuX8xDzijvEtrrTyqYnjz56Ryk2bSBRiMLQDPhe2ZFNuNZz2Z4 EP7S2RTJML8twwVepBbW6lg1kqWMwfBBuKPzGDtxaA9N91e5hTulGJPBOCzT2zVd8ZFN v0dQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=IajutrYR1gNldwnX+rwhURWEvmgjhBqy7g0eKQMrBVs=; b=hQMD2kfkYuaS8MGQgW42rR7xNeLU7uKZ6krGaCostu5wtmu6QAx+tB7le6nPbzRRSb vP56Tk6Ja0cZMANGsSGAjwRHqth0bH8az9TIr7n6pWxRz5dsHA71aPSfbN9Zsr2kU4MY A/znW2YKA0BsWJiWurd62iSMucQ9nTYjqcbUoLL3NOYEs0LFUVwOBj+xbCR7lTZFJhlG joBEI+BIMORRTyKlZ+Ext3gV5EFegLefq/d4GKH0JmAcVQyI0k8IC6NCGCpjj9K89hXX EiMscSem9zHJv/jEcaW0jBHOsgMgX8XxmWW01vfbIycA7CSQUwPw3tdar+tHElX48deb WoFA== X-Gm-Message-State: AJIora/neQUzVs7QQvMHu0Hpw0oPXjmuqc5yK8Cb24suz5+YtsZ71970 KjSf2eQDhPMfZeBuHdbuFOzvOQ== X-Google-Smtp-Source: AGRyM1uPJwgj7AqMyj+5vZko6fV+c6MYVQ1nKW7O6lLwEGglY6UbZr1YquCVRjcgQcRntK0ORrDZkg== X-Received: by 2002:a17:90a:9488:b0:1f2:2768:facf with SMTP id s8-20020a17090a948800b001f22768facfmr4044039pjo.38.1658376340385; Wed, 20 Jul 2022 21:05:40 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.228]) by smtp.gmail.com with ESMTPSA id f4-20020a170902684400b0016bdf0032b9sm384368pln.110.2022.07.20.21.05.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Jul 2022 21:05:40 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, surenb@google.com, mingo@redhat.com, peterz@infradead.org, tj@kernel.org, corbet@lwn.net, akpm@linux-foundation.org, rdunlap@infradead.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, Chengming Zhou Subject: [PATCH 9/9] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Date: Thu, 21 Jul 2022 12:04:39 +0800 Message-Id: <20220721040439.2651-10-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220721040439.2651-1-zhouchengming@bytedance.com> References: <20220721040439.2651-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. For performance impact consideration, this is enabled by default when CONFIG_IRQ_TIME_ACCOUNTING, but can be disabled by kernel cmdline parameter "psi_irq=3D". Signed-off-by: Chengming Zhou Reported-by: kernel test robot Reviewed-by: Chengming Zhou Tested-by: Chengming Zhou --- .../admin-guide/kernel-parameters.txt | 5 ++ include/linux/psi.h | 1 + include/linux/psi_types.h | 7 +- kernel/cgroup/cgroup.c | 27 +++++++ kernel/sched/core.c | 1 + kernel/sched/psi.c | 76 ++++++++++++++++++- kernel/sched/stats.h | 13 ++++ 7 files changed, 126 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentatio= n/admin-guide/kernel-parameters.txt index 6beef5b8bc36..1067dde299a0 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4425,6 +4425,11 @@ Format: default: enabled =20 + psi_irq=3D [KNL] Enable or disable IRQ/SOFTIRQ pressure stall + information tracking. + Format: + default: enabled when CONFIG_IRQ_TIME_ACCOUNTING. + psmouse.proto=3D [HW,MOUSE] Highest PS2 mouse protocol extension to probe for; one of (bare|imps|exps|lifebook|any). psmouse.rate=3D [HW,MOUSE] Set desired mouse report rate, in reports diff --git a/include/linux/psi.h b/include/linux/psi.h index aa168a038242..f5cf3e45d5a5 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -14,6 +14,7 @@ struct css_set; #ifdef CONFIG_PSI =20 extern struct static_key_false psi_disabled; +extern struct static_key_true psi_irq_enabled; extern struct psi_group psi_system; =20 void psi_init(void); diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c124f7d186d0..195f123b1cd1 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -47,7 +47,8 @@ enum psi_res { PSI_IO, PSI_MEM, PSI_CPU, - NR_PSI_RESOURCES =3D 3, + PSI_IRQ, + NR_PSI_RESOURCES =3D 4, }; =20 /* @@ -63,9 +64,11 @@ enum psi_states { PSI_MEM_FULL, PSI_CPU_SOME, PSI_CPU_FULL, + PSI_IRQ_SOME, + PSI_IRQ_FULL, /* Only per-CPU, to weigh the CPU in the global average: */ PSI_NONIDLE, - NR_PSI_STATES =3D 7, + NR_PSI_STATES =3D 9, }; =20 enum psi_aggregators { diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 1424da7ed2c4..cf61df0ac892 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3683,6 +3683,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kern= fs_open_file *of, return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); } =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int cgroup_irq_pressure_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp =3D seq_css(seq)->cgroup; + struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : &cgrp= ->psi; + + return psi_show(seq, psi, PSI_IRQ); +} + +static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ); +} +#endif + static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -5079,6 +5096,16 @@ static struct cftype cgroup_base_files[] =3D { .poll =3D cgroup_pressure_poll, .release =3D cgroup_pressure_release, }, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + { + .name =3D "irq.pressure", + .flags =3D CFTYPE_PRESSURE, + .seq_show =3D cgroup_irq_pressure_show, + .write =3D cgroup_irq_pressure_write, + .poll =3D cgroup_pressure_poll, + .release =3D cgroup_pressure_release, + }, +#endif #endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f5f2d3542b05..08637cfb7ed9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 del= ta) =20 rq->prev_irq_time +=3D irq_delta; delta -=3D irq_delta; + psi_account_irqtime(rq->curr, irq_delta); #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 8d76920f47b3..6a0894e28780 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -141,6 +141,7 @@ static int psi_bug __read_mostly; =20 DEFINE_STATIC_KEY_FALSE(psi_disabled); DEFINE_STATIC_KEY_TRUE(psi_cgroups_enabled); +DEFINE_STATIC_KEY_TRUE(psi_irq_enabled); =20 #ifdef CONFIG_PSI_DEFAULT_DISABLED static bool psi_enable; @@ -150,6 +151,12 @@ static bool psi_enable =3D true; =20 static bool psi_inner_cgroup __read_mostly =3D true; =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static bool psi_irq_enable =3D true; +#else +static bool psi_irq_enable; +#endif + static int __init setup_psi(char *str) { return kstrtobool(str, &psi_enable) =3D=3D 0; @@ -162,6 +169,12 @@ static int __init setup_psi_inner_cgroup(char *str) } __setup("psi_inner_cgroup=3D", setup_psi_inner_cgroup); =20 +static int __init setup_psi_irq(char *str) +{ + return kstrtobool(str, &psi_irq_enable) =3D=3D 0; +} +__setup("psi_irq=3D", setup_psi_irq); + /* Running averages - we need to be higher-res than loadavg */ #define PSI_FREQ (2*HZ+1) /* 2 sec intervals */ #define EXP_10s 1677 /* 1/exp(2s/10s) as fixed-point */ @@ -215,12 +228,16 @@ void __init psi_init(void) if (!psi_enable) { static_branch_enable(&psi_disabled); static_branch_disable(&psi_cgroups_enabled); + static_branch_disable(&psi_irq_enabled); return; } =20 if (!cgroup_psi_enabled()) static_branch_disable(&psi_cgroups_enabled); =20 + if (!psi_irq_enable) + static_branch_disable(&psi_irq_enabled); + psi_period =3D jiffies_to_nsecs(PSI_FREQ); group_init(&psi_system); } @@ -893,6 +910,28 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, } } =20 +void psi_groups_account_irqtime(struct task_struct *task, u32 delta) +{ + struct psi_group *group =3D task_psi_group(task); + int cpu =3D task_cpu(task); + u64 now =3D cpu_clock(cpu); + struct psi_group_cpu *groupc; + + for_each_psi_group(group) { + groupc =3D per_cpu_ptr(group->pcpu, cpu); + + write_seqcount_begin(&groupc->seq); + + record_times(groupc, now); + groupc->times[PSI_IRQ_FULL] +=3D delta; + + write_seqcount_end(&groupc->seq); + + if (group->poll_states & (1 << PSI_IRQ_FULL)) + psi_schedule_poll_work(group, 1); + } +} + /** * psi_memstall_enter - mark the beginning of a memory stall section * @flags: flags to handle nested sections @@ -1069,7 +1108,7 @@ int psi_show(struct seq_file *m, struct psi_group *gr= oup, enum psi_res res) group->avg_next_update =3D update_averages(group, now); mutex_unlock(&group->avgs_lock); =20 - for (full =3D 0; full < 2; full++) { + for (full =3D (res =3D=3D PSI_IRQ); full < 2; full++) { unsigned long avg[3] =3D { 0, }; u64 total =3D 0; int w; @@ -1111,9 +1150,12 @@ struct psi_trigger *psi_trigger_create(struct psi_gr= oup *group, else return ERR_PTR(-EINVAL); =20 - if (state >=3D PSI_NONIDLE) + if (state >=3D PSI_NONIDLE || state =3D=3D PSI_IRQ_SOME) return ERR_PTR(-EINVAL); =20 + if (!static_branch_likely(&psi_irq_enabled) && state =3D=3D PSI_IRQ_FULL) + return ERR_PTR(-EOPNOTSUPP); + if (window_us < WINDOW_MIN_US || window_us > WINDOW_MAX_US) return ERR_PTR(-EINVAL); @@ -1395,6 +1437,33 @@ static const struct proc_ops psi_cpu_proc_ops =3D { .proc_release =3D psi_fop_release, }; =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int psi_irq_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IRQ); +} + +static int psi_irq_open(struct inode *inode, struct file *file) +{ + return psi_open(file, psi_irq_show); +} + +static ssize_t psi_irq_write(struct file *file, const char __user *user_bu= f, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_IRQ); +} + +static const struct proc_ops psi_irq_proc_ops =3D { + .proc_open =3D psi_irq_open, + .proc_read =3D seq_read, + .proc_lseek =3D seq_lseek, + .proc_write =3D psi_irq_write, + .proc_poll =3D psi_fop_poll, + .proc_release =3D psi_fop_release, +}; +#endif + static int __init psi_proc_init(void) { if (psi_enable) { @@ -1402,6 +1471,9 @@ static int __init psi_proc_init(void) proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops); +#endif } return 0; } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index e930b8fa6253..10926cdaccc8 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -111,6 +111,7 @@ void psi_change_groups(struct task_struct *task, int cl= ear, int set); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); +void psi_groups_account_irqtime(struct task_struct *task, u32 delta); =20 /* * PSI tracks state that persists across sleeps, such as iowaits and @@ -196,6 +197,17 @@ static inline void psi_sched_switch(struct task_struct= *prev, psi_task_switch(prev, next, sleep); } =20 +static inline void psi_account_irqtime(struct task_struct *task, u32 delta) +{ + if (!static_branch_likely(&psi_irq_enabled)) + return; + + if (!task->pid) + return; + + psi_groups_account_irqtime(task, delta); +} + #else /* CONFIG_PSI */ static inline void psi_enqueue(struct task_struct *p, bool wakeup) {} static inline void psi_dequeue(struct task_struct *p, bool sleep) {} @@ -203,6 +215,7 @@ static inline void psi_ttwu_dequeue(struct task_struct = *p) {} static inline void psi_sched_switch(struct task_struct *prev, struct task_struct *next, bool sleep) {} +static inline void psi_account_irqtime(struct task_struct *curr, u32 delta= ) {} #endif /* CONFIG_PSI */ =20 #ifdef CONFIG_SCHED_INFO --=20 2.36.1