From nobody Sun Feb 8 12:14:52 2026 Received: from NAM12-DM6-obe.outbound.protection.outlook.com (mail-dm6nam12on2073.outbound.protection.outlook.com [40.107.243.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EAD35AD21 for ; Fri, 27 Dec 2024 06:20:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.243.73 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735280406; cv=fail; b=NzQditzhS/b3ijMdXi6TzX0PBU3oBmWU/NhXuHSEgxeSd4sublN5P/Zt0hSPFP87mkW1Wwrz3a6dYKBKSVKIDDDk5PeW0HW+lTBLRF7Co1v/VftCsVwlYJmpS654ldMuQpysSb99EGdv550IdCb6qCSctFEIEcDXvcbGczCRU/U= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735280406; c=relaxed/simple; bh=z23ffJFZKuaU+iE/pWffAa+Zawr0lv1IvUaI2iN6NwU=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=LnhzYuBIa6idL61AZmBqa4mAjqGWsnqEk2l33wDAONlmzxiUTJ/khaKYZC7Hv2pCitExPMCxDG0HNsIPmebR24DTtbUwVeaDjseuqtY+u6b9oJWHUb1xywLoyDjAaQTAWIZ7auUW6oJT0LC2A8nG2OaxNR0ab9kG7rprQ3GNUhQ= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=E0fp1eGC; arc=fail smtp.client-ip=40.107.243.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="E0fp1eGC" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=P/1yweVwSStI8Vtgds4n2yHCoNC+F4qukjvb9+Qi5gLO9AVOnRpWEl7Ci23IVhiK91bKNI56GMWC+Rr9vsJWHbtUKXewtwd0PQEKEFkNEE9Y/k2nujg+T6F/M06XtdDHsUIor6eqdB+UFpPC6sL9VZKWoDEBlcQLp8/cmFyvQxBJ+a0eTdRHuVx7DXT69N0rY3kXTYFRXiwMpz4EFhZPYypB2lDoJF92LBQusHA5ABOh+VZuHzQTmoz7kUE8M0ZhAJy+JYAk53pBgifcULVIs6GGZtc+HhG13giz+wu4FuSg+0mpYGlBKvbcmBc1ue0QoWuhChs3KbiHZLWDwPCazA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=xPcXJFT3X+H3t6YYg/vJKjebhAkO4QC8ZBdQcxw7JMc=; b=TuAMGQnFJR0EgFZtALgIL2l//QERG6KP+j5PpriPF18rHyxnUtS8pvyJy5PcMhdnor6opJx62yujpPxIgwX/zBfYE97L+gvZ+bBwP2NSxCECJoFLDkLqz1p2jswtFXnOD+BQm/RMA6/WhgQWq8d/uO8I55rFF+ll4fAWvh5u/tY+1cUyxotahauNvA4mEjlEh5bbJeXWePPwfxrwSn9b8UaUD4VizZDMUg5xC7AqsAGuHjTfMjylbK4z454ibCc5sMAKZ1rjUIJ2xvE8jZiNeJxI8Jk1w3fsuFhSRZUzuVRhudx3lBzSpF8T4B68rtf9SWkfvYtwtJqgZovqFaxa6Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=linux.dev smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=xPcXJFT3X+H3t6YYg/vJKjebhAkO4QC8ZBdQcxw7JMc=; b=E0fp1eGCByeLgag2aQSvQlv3JHc1EtZOEAVnDrABTgCMkig05jpWIneZWawPonMXS2fJ6FY/oNfnMnyPme3m7Kvw8iuPMAnYJNfg6IBBgP5QrA+PTiQVPtCGyxz9Ntmdgntdr2B6DZLyFC2aK39Wu2extftlAPCnKlZOTlYy6N4= Received: from BL1PR13CA0380.namprd13.prod.outlook.com (2603:10b6:208:2c0::25) by DS7PR12MB5743.namprd12.prod.outlook.com (2603:10b6:8:72::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8293.14; Fri, 27 Dec 2024 06:19:57 +0000 Received: from BN1PEPF0000468E.namprd05.prod.outlook.com (2603:10b6:208:2c0:cafe::4f) by BL1PR13CA0380.outlook.office365.com (2603:10b6:208:2c0::25) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.8293.9 via Frontend Transport; Fri, 27 Dec 2024 06:19:57 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by BN1PEPF0000468E.mail.protection.outlook.com (10.167.243.139) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.8293.12 via Frontend Transport; Fri, 27 Dec 2024 06:19:56 +0000 Received: from BLRKPRNAYAK.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Fri, 27 Dec 2024 00:19:50 -0600 From: K Prateek Nayak To: Chengming Zhou , Johannes Weiner , Suren Baghdasaryan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , CC: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , "Gautham R. Shenoy" , K Prateek Nayak Subject: [PATCH v2] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags Date: Fri, 27 Dec 2024 06:19:41 +0000 Message-ID: <20241227061941.2315-1-kprateek.nayak@amd.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: SATLEXMB04.amd.com (10.181.40.145) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: BN1PEPF0000468E:EE_|DS7PR12MB5743:EE_ X-MS-Office365-Filtering-Correlation-Id: 4c9adc19-9cb8-48f8-735f-08dd263e7c07 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|36860700013|376014|1800799024|82310400026|7053199007; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?46s0sLQpgcsk2pCcOet6po6m31Gba///XUbT96DgotXz1lfRdd55530aU+Fo?= =?us-ascii?Q?8FymYQFXeLH6Jx9MN3bxnylyR69szdhCdypWnOc8ssxxoiH9CL1y9mjdJUHS?= =?us-ascii?Q?jEDfAbfCn3EBvTxoYTrMQLKpltcqzTe8IHnuxxcS3oayJDfEpYHQPr4zSVo5?= =?us-ascii?Q?UfuZFxdfGHkUgkizycLLkJg/zMw+ZzwDNaHeWLYKuAvsowo31grM0B54xhhb?= =?us-ascii?Q?0yQbsuHErgZeVGEueWZlj4KxzVO+T0ZUjiwzj5oBWvrnBOynUK1NzPtYG0Ew?= =?us-ascii?Q?DzP6ZWn5sLiamJAkp1AfUEgOu27te/Ab73J3AYqViA3aa0yg2hs8YWwrJnJf?= =?us-ascii?Q?F7OYEyjIp8C/+bFcX1I9tGQ3WeJ0sDzH/qohDA/W5W0sDWUWEVZy4s/ZEp3i?= =?us-ascii?Q?XQTcHMnLCUrB7fu4dMf7rHQQ6Sg7VNpzWUEsv/cG00Y49gewvJ85kCW/qm29?= =?us-ascii?Q?9GWcP6Z0bgucFAsgIGzttpPzXQzMMVXUnuoPOEXdtM3oGDvMkf3wEr7V6vCG?= =?us-ascii?Q?hrYzOhhR+5oPR33QEV6avVlsNMgkVv+0DZ0h7BRe+gOukEhkSPGtttcOMRFd?= =?us-ascii?Q?mkV+gTaOLt+FOOjEtumfutWU8yDY6nrzpP+2W7VlPWsxWhHUUCarXaMrDSz7?= =?us-ascii?Q?EWorKv3cox2hNqDvIM2986Vk09Zl9WGCHXyGbpROwVILEYRFPdedapJVY2+b?= =?us-ascii?Q?fq33DaVEcuudQD6lhh1YszFq+m8gkPZpp3r4cjapOWUdKZqI7Kn4am1DX3Pn?= =?us-ascii?Q?/oS/631ZJBtlMF3T8llklkynzH2lctXdaDe/MTwq1WD0NQGSgtwCnuDeGgyr?= =?us-ascii?Q?Me4CL2t7taJth/Qz460v7w/r8WF4NDxkmdbsifgW4M7gopDM5ZQgyH9PvMpX?= =?us-ascii?Q?ymxTjkxKKv+O9iPkMP5YnC20ZqSkgKo3cHYmiYSJfJxQgFAR6fm8l04tfdBf?= =?us-ascii?Q?Mw6jY/+QUmWIouqrQMVMYxHQl9T+wJEgUR8GJP+08Wu+C/JG+7KmVv6588As?= =?us-ascii?Q?ubsJKM3MJ7pJVyn8VHCCtyuuIvaTb+Y/TtuikqD1Xhxt+WEQu6vqisaH1OBU?= =?us-ascii?Q?At6G9uTsCgsfYQ14JFkWMP/8lQfe7JfzWM6u0pPiKjGWLMshMMvga6uMPUaa?= =?us-ascii?Q?A9sXbXjcg5TvMBOavYEB64SYnczKMcXq3TAQG5qUxXdn5jlNXwOtCHS9jLUy?= =?us-ascii?Q?21xYO3pESBHaT0a0c7pl/frLxWR+82t4izOJ21v8UXpodRsq1Avb3fAo9/eS?= =?us-ascii?Q?nOTfcSfLUkfB6dT/lkJPKXh44Kxiv9lffTit9WfVMmfgArgpnura89QgwKfJ?= =?us-ascii?Q?WpukEJb+ZR9xsKsshsl/qZEtwxBhwKgXsxnjTvF0Ucbo2j6mg91B+Xi2Xbco?= =?us-ascii?Q?pEJWSE9u/I5XBYNNYYQr6MQRVVX97P0R/WByaqgquEe+sXiCeWBku7A5j9Hc?= =?us-ascii?Q?K0g1Zo2HmXezTEzzz81Ro0LV4CblXaRzZJv6OiLNJTc9y3gYUSX8nX6tDoVm?= =?us-ascii?Q?9ol0MI8uCKcV7RM=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(36860700013)(376014)(1800799024)(82310400026)(7053199007);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 27 Dec 2024 06:19:56.6075 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 4c9adc19-9cb8-48f8-735f-08dd263e7c07 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: BN1PEPF0000468E.namprd05.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: DS7PR12MB5743 Content-Type: text/plain; charset="utf-8" From: Chengming Zhou When running hackbench in a cgroup with bandwidth throttling enabled, following PSI splat was observed: psi: inconsistent task state! task=3D1831:hackbench cpu=3D8 psi_flags= =3D14 clear=3D0 set=3D4 When investigating the series of events leading up to the splat, following sequence was observed: [008] d..2.: sched_switch: ... =3D=3D> next_comm=3Dhackbench next_pid= =3D1831 next_prio=3D120 ... [008] dN.2.: dequeue_entity(task delayed): task=3Dhackbench pid=3D1831 = cfs_rq->throttled=3D0 [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on= CPU8 # CPU8 goes into newidle balance and releases the rq lock ... # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=3D1831) [015] d..4.: psi_flags_change: psi: task state: task=3D1831:hackbench c= pu=3D8 psi_flags=3D14 clear=3D0 set=3D4 final=3D14 # Splat (cfs_rq->throttl= ed=3D1) [015] d..4.: sched_wakeup: comm=3Dhackbench pid=3D1831 prio=3D120 targe= t_cpu=3D008 # Task has woken on a throttled hierarchy [008] d..2.: sched_switch: prev_comm=3Dhackbench prev_pid=3D1831 prev_p= rio=3D120 prev_state=3DS =3D=3D> ... psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags for the blocked entity, however, with the introduction of DELAY_DEQUEUE, the block task can wakeup when newidle balance drops the runqueue lock during __schedule(). If a task wakes before psi_sched_switch() adjusts the PSI flags, skip any modifications in psi_enqueue() which would still see the flags of a running task and not a blocked one. Instead, rely on psi_sched_switch() to do the right thing. Since the status returned by try_to_block_task() may no longer be true by the time schedule reaches psi_sched_switch(), check if the task is blocked or not using a combination of task_on_rq_queued() and p->se.sched_delayed checks. [ prateek: Commit message, testing, early bailout in psi_enqueue() ] Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") # 1a6151017ee5 Link: https://lore.kernel.org/all/409b4a72-483e-467b-8d00-9a8dae48bdc9@linu= x.dev/ Signed-off-by: Chengming Zhou Signed-off-by: K Prateek Nayak Reviewed-by: Chengming Zhou --- v1..v2: o Removed any considerations of psi_ttwu_dequeue() racing with psi_sched_switch() and use the solution from Chengming to only consider a requeue of delayed task. o Reworded the commit message to only highlight the relevant bits and corrected the Fixes tag. Thank you Chengming for patiently explaining all the nunaces that led to the splat :) This patch is based on tip:sched/core at commit af98d8a36a96 ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug") Reproducer for the PSI splat: mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs # Ridiculous limit on SMP to throttle multiple rqs at once echo "50000 100000" > /sys/fs/cgroup/test/cpu.max perf bench sched messaging -t -p -l 100000 -g 16 This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but also on a 32 vCPU VM. --- kernel/sched/core.c | 6 +++--- kernel/sched/stats.h | 4 ++++ 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 84902936a620..3d2ab0ad80c9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6643,7 +6643,6 @@ static void __sched notrace __schedule(int sched_mode) * as a preemption by schedule_debug() and RCU. */ bool preempt =3D sched_mode > SM_NONE; - bool block =3D false; unsigned long *switch_count; unsigned long prev_state; struct rq_flags rf; @@ -6704,7 +6703,7 @@ static void __sched notrace __schedule(int sched_mode) goto picked; } } else if (!preempt && prev_state) { - block =3D try_to_block_task(rq, prev, prev_state); + try_to_block_task(rq, prev, prev_state); switch_count =3D &prev->nvcsw; } =20 @@ -6750,7 +6749,8 @@ static void __sched notrace __schedule(int sched_mode) =20 migrate_disable_switch(rq, prev); psi_account_irqtime(rq, prev, next); - psi_sched_switch(prev, next, block); + psi_sched_switch(prev, next, !task_on_rq_queued(prev) || + prev->se.sched_delayed); =20 trace_sched_switch(preempt, prev, next, prev_state); =20 diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 8ee0add5a48a..6ade91bce63e 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -138,6 +138,10 @@ static inline void psi_enqueue(struct task_struct *p, = int flags) if (flags & ENQUEUE_RESTORE) return; =20 + /* psi_sched_switch() will handle the flags */ + if (task_on_cpu(task_rq(p), p)) + return; + if (p->se.sched_delayed) { /* CPU migration of "sleeping" task */ SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED)); base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb --=20 2.34.1