From nobody Sun Feb 8 07:26:52 2026 Received: from NAM10-DM6-obe.outbound.protection.outlook.com (mail-dm6nam10on2051.outbound.protection.outlook.com [40.107.93.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1A0E14A0B7 for ; Thu, 26 Dec 2024 05:35:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=fail smtp.client-ip=40.107.93.51 ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735191306; cv=fail; b=dxurlI/sh97ybu5WKHa2f/PFOcw3qsx6FA+BQ3bwahkloO6oj0ZRKAaDkE/cFxwbTH/tmauNYO9uvdqBH/RCv4uYjDPkTr7k7rs1JC5QF9PZUzBCqpSamyYEfJt5xgEZLNBlmvS9j92S3juSP1JVEgh+OpY+CkxasZY/poU2xms= ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1735191306; c=relaxed/simple; bh=Li0Ema6+bPs7bYqwmweKT8bE5GmA1/p9FOfZPhU+rhM=; h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type; b=An1Dp8+dISTwHROKYqw2oGQSaihRBiHQ8aq91/u4uS73x4Egx6AZbb/U3QzVC73EUtksCOUhbhdbumTqrs4cFhSE8C1UkWU6vht42RcAoxXnLyG+ULfY/EuleOoVIfg6O+drEliBG2YqKuMJM1IPJo4DnqfcnmG7za9DgtTku18= ARC-Authentication-Results: i=2; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com; spf=fail smtp.mailfrom=amd.com; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b=zUMTw2xc; arc=fail smtp.client-ip=40.107.93.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amd.com Authentication-Results: smtp.subspace.kernel.org; spf=fail smtp.mailfrom=amd.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com header.b="zUMTw2xc" ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=x1QNSk7DoDRQ8wOw6RNuVXRICjA1NcrfErvWTK6E/yKPiEC5//cH1aPbLAMDGr3PP4QLSep1oDu2cFKqH6X6zYtNOvVEe8POzlrReIGGixynGlDATTBSG884NjBkiWB+Uo7scA/ivEuJz9HZjyXkEE553Xf11Eq1fb7tZy/hD5hZ1cWbfaW9LJoKlWCPzsiWkWwkMvXzaI8x7HJ0G36CF0VpAqcVh2MksFIwE+vR7zLh28A/EyN5VVr1qDqcuH7gUTDM8nE4EqcNYVLGEbhENnSTLHJIXx8ZxlOqVNByoeAF1qz8WN91AsFScaAhS8SdnYsWYm4f8KqiswgoXFql/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=iTeYmQ+ZWTICMbzReG61ggpVMyVGB//hmTuwhQPG4N8=; b=ga6BVQjzhVdl1wq5mflS+fLQ/vE7FPJDGv5MKDBJrjxiD54teXWXVKNUC9oUEPhe+s4kS6uiq7LCs1OeLSqJlGzHD8uVbcRII/ZzJfc7VsGGOHM2MmvE/uBttJQPuLNM0+WPoK6alX+5EtvbJZe6bTnKu3zKBSFgaPZFmAaMNgGTAvNocrmVB4Nl9c4Oza1YZWteL/MQ5UUdE0QzTGypyfMl1xjt1qmQCFnEjkugkQOJI2dYhqx2CSp3rE4QBKd3jQkH9+Lo0C0+9bYCLs+Q8iqxdxnuEkg44mFwkhh4kl8BHF63ilppOUeTIxovGOM/dmr37p+QJmZ/07INK3Tdaw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=cmpxchg.org smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=iTeYmQ+ZWTICMbzReG61ggpVMyVGB//hmTuwhQPG4N8=; b=zUMTw2xcHuIcwqC47sn/G9ZxPUG7fqMQg7llRMKAO9tzegUZqm2WVeH44Kl8Gqvi2IHB8jJRhCO7ytqysoPD/704A/IltvH/z6MtD80OayMdCRNgMoMzcoHCBv+iFhrsWryhEeUnjKOpU5TLEEn4chawox9bsucULJpR/3FoGxE= Received: from SA9PR10CA0005.namprd10.prod.outlook.com (2603:10b6:806:a7::10) by IA1PR12MB6651.namprd12.prod.outlook.com (2603:10b6:208:3a0::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8293.14; Thu, 26 Dec 2024 05:34:55 +0000 Received: from SA2PEPF00003F67.namprd04.prod.outlook.com (2603:10b6:806:a7:cafe::3f) by SA9PR10CA0005.outlook.office365.com (2603:10b6:806:a7::10) with Microsoft SMTP Server (version=TLS1_3, cipher=TLS_AES_256_GCM_SHA384) id 15.20.8293.14 via Frontend Transport; Thu, 26 Dec 2024 05:34:55 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17) smtp.mailfrom=amd.com; dkim=none (message not signed) header.d=none;dmarc=pass action=none header.from=amd.com; Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C Received: from SATLEXMB04.amd.com (165.204.84.17) by SA2PEPF00003F67.mail.protection.outlook.com (10.167.248.42) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.20.8293.12 via Frontend Transport; Thu, 26 Dec 2024 05:34:55 +0000 Received: from BLRKPRNAYAK.amd.com (10.180.168.240) by SATLEXMB04.amd.com (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Wed, 25 Dec 2024 23:34:49 -0600 From: K Prateek Nayak To: Johannes Weiner , Suren Baghdasaryan , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , CC: Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider , Chengming Zhou , Muchun Song , "Gautham R. Shenoy" , K Prateek Nayak Subject: [PATCH] psi: Fix race when task wakes up before psi_sched_switch() adjusts flags Date: Thu, 26 Dec 2024 05:34:41 +0000 Message-ID: <20241226053441.1110-1-kprateek.nayak@amd.com> X-Mailer: git-send-email 2.43.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-ClientProxiedBy: SATLEXMB04.amd.com (10.181.40.145) To SATLEXMB04.amd.com (10.181.40.145) X-EOPAttributedMessage: 0 X-MS-PublicTrafficType: Email X-MS-TrafficTypeDiagnostic: SA2PEPF00003F67:EE_|IA1PR12MB6651:EE_ X-MS-Office365-Filtering-Correlation-Id: a332256a-fc5f-4ecc-d3f1-08dd256f0766 X-MS-Exchange-SenderADCheck: 1 X-MS-Exchange-AntiSpam-Relay: 0 X-Microsoft-Antispam: BCL:0;ARA:13230040|7416014|376014|1800799024|36860700013|30052699003|82310400026; X-Microsoft-Antispam-Message-Info: =?us-ascii?Q?HH9NT3vJxRRVfhtXCnBL+Ku4lGf1OduBQRjmaeXS+A2xwLgZqWgaZck9D5U/?= =?us-ascii?Q?fbs4f6r7xMQcSzSc/bVd2iF5Tdx+ztDjWBgVKVEEYpbm4yXHr+jJZ22xlz9o?= =?us-ascii?Q?dFiKI9XuNau7dhmahm1niVTkCbni0X3bNHg6ptqCRS6YUhU9wnaIlNsHcKC7?= =?us-ascii?Q?0CfCKXVq+K4FX1z3bQIMO7evsPxnCPNZCLQN0BfBiSqY58BKHNNZacWdpgwA?= =?us-ascii?Q?tGUW3KTA04LvDg6iPS487hhp/UvCRMWkujwo+7o904LULwFconk5Rwmi7qi+?= =?us-ascii?Q?bYQQ1wKSQpskZ61w/nvRg/cI77UuTC8ZcNtOQXBTGx5vQ+zJEG6hbrucvdpu?= =?us-ascii?Q?kUHnOyiF6F68xTNbfH9eRBTiaeQENwoHl7tbNWB4rjaZkkVlBhJVt9w7FybG?= =?us-ascii?Q?/ss0DZYECvWFE//5NVLuMpPuYC2MA4ktnrckdIM9gp4CuQ3yP5Zj+knyhQyF?= =?us-ascii?Q?sY5GTvLaNMAqnqJTGEVcekLBqEEzaSiKuNR3zTIQ6TuxaQx4yFlt/aeSAE//?= =?us-ascii?Q?MCSIB6+iG7wii+cargh2UQb4tQn0MgHO5ioojT4mZllx23d7RiSaqk9BB+De?= =?us-ascii?Q?BZy0UCZbwIJzDc44AxRqgfknhREHuQK37A/t4k1jEefNMgLSiabCBwDZLm9s?= =?us-ascii?Q?ggKNjS258FOVrMCPez0luqRxFdT6hXEGi9/SqF2wAqXCiZ8HUWhOGWBsrO/r?= =?us-ascii?Q?n4OjcKBa7Yv59KdFsywIz8qweW4761SLQxPfDcRR0IEUjM/yqtJw/G7yBsMD?= =?us-ascii?Q?+BrfvQtkMopZwoH8C3ChM7/JSfNHsVEsaHZaAPT3qUP7EtPdrO7lRAcJZxT7?= =?us-ascii?Q?r6g7mIg61pmx2I9ufI213Ssq0gX16MlWLj5e7IaRK+ktMOiuOoxuLfls+bcQ?= =?us-ascii?Q?szUmibOMcgS6J1fVpoJvBOwJ8PesSR4hAdY7uqQLuL4cSA+E+1IpZi8aT4CJ?= =?us-ascii?Q?L+4M/D5q8NoVWQmQZW+8S3rB//W+a3RSTtn9TQ11oTzVlQWPA/ItHXYG7rxH?= =?us-ascii?Q?3qI/fYrBXjXY1nr/nEKqFhGQ1jd9tpM2YL+wp87ML/m0EK0cMEAbtccaMLOP?= =?us-ascii?Q?/0IkUQ02os5PYfoxpehshEi93ah7JR6AsmqCbogP0a53wATP7odWLuo1xYcf?= =?us-ascii?Q?IcspIHBTorBLU6/hNiGC+fXRWbvr/6Fbzi/7WCv05PmuOGn5dIExUlXnYPGj?= =?us-ascii?Q?5FNLWDCzjUV08wGOfDFclKVEwNH8wSjA7v9MK1aNF4ZaAuG5UC3jKLM8+oWz?= =?us-ascii?Q?+p0Q1EMkXELqLKEyXd63wAzGHuqOewkt2cy2fEXxVrbDM0oUeaV7DuNrHdjy?= =?us-ascii?Q?teAhQ4gJ8DqipdhmqroD4KWAxbcztY62R/lUWb99KV8bt2Ki0nv4kCoetWa8?= =?us-ascii?Q?ZTvID0awQlWf8/qIj6jqNZfKwLcH45n3dWOf/Joth2J+lvL5PB/5ssT3ewml?= =?us-ascii?Q?GCiOcaVhXaKMEkEcNIM1feQshDPipcypucvv93sWVUuN1LYRrL4+D6IPbVdW?= =?us-ascii?Q?nMhbhgJcStx6zRc=3D?= X-Forefront-Antispam-Report: CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(376014)(1800799024)(36860700013)(30052699003)(82310400026);DIR:OUT;SFP:1101; X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Dec 2024 05:34:55.1474 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: a332256a-fc5f-4ecc-d3f1-08dd256f0766 X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com] X-MS-Exchange-CrossTenant-AuthSource: SA2PEPF00003F67.namprd04.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA1PR12MB6651 Content-Type: text/plain; charset="utf-8" When running hackbench in a cgroup with bandwidth throttling enabled, following PSI splat was observed: psi: inconsistent task state! task=3D1831:hackbench cpu=3D8 psi_flags= =3D14 clear=3D0 set=3D4 When investigating the series of events leading up to the splat, following sequence was observed: [008] d..2.: sched_switch: ... =3D=3D> next_comm=3Dhackbench next_pid= =3D1831 next_prio=3D120 ... [008] dN.2.: dequeue_entity(task delayed): task=3Dhackbench pid=3D1831 = cfs_rq->throttled=3D0 [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on= CPU8 # CPU8 goes into newidle balance and releases the rq lock ... # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=3D1831) [015] d..4.: psi_flags_change: psi: task state: task=3D1831:hackbench c= pu=3D8 psi_flags=3D14 clear=3D0 set=3D4 final=3D14 # Splat (cfs_rq->throttl= ed=3D1) [015] d..4.: sched_wakeup: comm=3Dhackbench pid=3D1831 prio=3D120 targe= t_cpu=3D008 # Task has woken on a throttled hierarchy [008] d..2.: sched_switch: prev_comm=3Dhackbench prev_pid=3D1831 prev_p= rio=3D120 prev_state=3DS =3D=3D> ... psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags for the blocked entity, however, the following race is possible with psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to psi_sched_switch() __schedule() rq_lock(rq) try_to_block_task(p) psi_dequeue() [ psi_task_switch() is responsible for adjusting the PSI flags ] put_prev_entity(&p->se) try_to_wake_up(p) # no runnable task on rq->cfs ... sched_balance_newidle() raw_spin_rq_unlock(rq) __task_rq_lock(p) ... psi_enqueue()/psi_ttwu_dequeue() [Woops!] __task_rq_unlock(p) raw_spin_rq_lock(rq) ... [ p was re-enqueued or has migrated away ] ... psi_task_switch() [Too late!] raw_spin_rq_unlock(rq) The wakeup context will see the flags for a running task when the flags should have reflected the task being blocked. Similarly, a migration context in the wakeup path can clear the flags that psi_sched_switch() assumes will be set (TSK_ONCPU / TSK_RUNNING) Since the TSK_ONCPU flag has to be modified with the rq lock of task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to prevent the race. Specifically: o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one. psi_enqueue() will only be called with TSK_ONCPU set when the task is being requeued on the same CPU. If the task was migrated, psi_ttwu_dequeue() would have already cleared the PSI flags. psi_enqueue() cannot guarantee that this same task will be picked again when the scheduling CPU returns from newidle balance which is why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup without migration. o When psi_sched_switch() observes that prev's task_cpu() has changes or the TSK_ONCPU flag is not set, a wakeup has raced with the psi_sched_switch() trying to adjust the dequeue flag. If the next is same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have already adjusted the PSI flags and no further changes are required to prev's PSI flags. With the introduction of DELAY_DEQUEUE, the requeue path is considerably shortened and with the addition of bandwidth throttling in the __schedule() path, the race window is large enough to observed this issue. Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: K Prateek Nayak Reported-by: K Prateek Nayak Tested-by: K Prateek Nayak --- This patch is based on tip:sched/core at commit af98d8a36a96 ("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug") Reproducer for the PSI splat: mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs # Ridiculous limit on SMP to throttle multiple rqs at once echo "50000 100000" > /sys/fs/cgroup/test/cpu.max perf bench sched messaging -t -p -l 100000 -g 16 This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but also on a 32 vCPU VM. --- kernel/sched/core.c | 7 ++++- kernel/sched/psi.c | 65 ++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/stats.h | 16 ++++++++++- 3 files changed, 83 insertions(+), 5 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 84902936a620..9bbe51e44e98 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mod= e) rq->last_seen_need_resched_ns =3D 0; #endif =20 + /* + * PSI might have to deal with the consequences of newidle balance + * possibly dropping the rq lock and prev being requeued and selected. + */ + psi_sched_switch(prev, next, block); + if (likely(prev !=3D next)) { rq->nr_switches++; /* @@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode) =20 migrate_disable_switch(rq, prev); psi_account_irqtime(rq, prev, next); - psi_sched_switch(prev, next, block); =20 trace_sched_switch(preempt, prev, next, prev_state); =20 diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 84dad1511d1e..c355a6189595 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, bool sleep) { struct psi_group *group, *common =3D NULL; - int cpu =3D task_cpu(prev); + int prev_cpu, cpu; + + /* No race between psi_dequeue() and now */ + if (prev =3D=3D next && (prev->psi_flags & TSK_ONCPU)) + return; + + prev_cpu =3D task_cpu(prev); + cpu =3D smp_processor_id(); =20 if (next->pid) { + /* + * If next =3D=3D prev but TSK_ONCPU is cleared, the task was + * requeued when newidle balance dropped the rq lock and + * psi_enqueue() cleared the TSK_ONCPU flag. + */ psi_flags_change(next, 0, TSK_ONCPU); /* * Set TSK_ONCPU on @next's cgroups. If @next shares any @@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, */ group =3D task_psi_group(next); do { - if (per_cpu_ptr(group->pcpu, cpu)->state_mask & - PSI_ONCPU) { + /* + * Since newidle balance can drop the rq lock (see the next comment) + * there is a possibility of try_to_wake_up() migrating prev away + * before reaching here. Do not find common if task has migrated. + */ + if (prev_cpu =3D=3D cpu && + (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) { common =3D group; break; } @@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, } while ((group =3D group->parent)); } =20 + /* + * When a task is blocked, psi_dequeue() leaves the PSI flag + * adjustments to psi_task_switch() however, there is a possibility of + * rq lock being dropped in the interim and the task being woken up + * again before psi_task_switch() is called leading to psi_enqueue() + * seeing the flags for a running task. Specifically, the following + * scenario is possible: + * + * __schedule() + * rq_lock(rq) + * try_to_block_task(p) + * psi_dequeue() + * [ psi_task_switch() is responsible + * for adjusting the PSI flags ] + * put_prev_entity(&p->se) try_to_wake_up(p) + * # no runnable task on rq->cfs ... + * sched_balance_newidle() + * raw_spin_rq_unlock(rq) __task_rq_lock(p) + * ... psi_enqueue()/psi_ttwu_dequeue() [Woops!] + * __task_rq_unlock(p) + * raw_spin_rq_lock(rq) + * ... + * [ p was re-enqueued or has migrated away ] + * ... + * psi_task_switch() [Too late!] + * raw_spin_rq_unlock(rq) + * + * In the above case, psi_enqueue() can sees the p->psi_flags state + * before it is adjusted to account for dequeue in psi_task_switch(), + * or psi_ttwu_dequeue() can clear the p->psi_flags which + * psi_task_switch() tries to adjust assuming that the entity has just + * finished running. + * + * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use + * the combination of TSK_ONCPU and task_cpu(p) to catch the race + * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue() + * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct + * flags already for prev on this CPU, skip adjusting flags. + */ + if (prev =3D=3D next || prev_cpu !=3D cpu || !(prev->psi_flags & TSK_ONCP= U)) + return; + if (prev->pid) { int clear =3D TSK_ONCPU, set =3D 0; bool wake_clock =3D true; diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 8ee0add5a48a..f09903165456 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, = int flags) if (flags & ENQUEUE_RESTORE) return; =20 - if (p->se.sched_delayed) { + if (p->psi_flags & TSK_ONCPU) { + /* + * psi_enqueue() can race with psi_task_switch() where + * TSK_ONCPU will be still set for the task (see the + * comment in psi_task_switch()) + * + * Reaching here with TSK_ONCPU is only possible when + * the task is being enqueued on the same CPU. Since + * psi_task_switch() has not had the chance to adjust + * the flags yet, just clear the TSK_ONCPU which yields + * the same result as sleep + wakeup without migration. + */ + SCHED_WARN_ON(flags & ENQUEUE_MIGRATED); + clear =3D TSK_ONCPU; + } else if (p->se.sched_delayed) { /* CPU migration of "sleeping" task */ SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED)); if (p->in_memstall) base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb --=20 2.34.1