From nobody Sun Feb  8 07:26:52 2026
Received: from NAM10-DM6-obe.outbound.protection.outlook.com
 (mail-dm6nam10on2051.outbound.protection.outlook.com [40.107.93.51])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1A0E14A0B7
	for <linux-kernel@vger.kernel.org>; Thu, 26 Dec 2024 05:35:03 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=fail smtp.client-ip=40.107.93.51
ARC-Seal: i=2; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1735191306; cv=fail;
 b=dxurlI/sh97ybu5WKHa2f/PFOcw3qsx6FA+BQ3bwahkloO6oj0ZRKAaDkE/cFxwbTH/tmauNYO9uvdqBH/RCv4uYjDPkTr7k7rs1JC5QF9PZUzBCqpSamyYEfJt5xgEZLNBlmvS9j92S3juSP1JVEgh+OpY+CkxasZY/poU2xms=
ARC-Message-Signature: i=2; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1735191306; c=relaxed/simple;
	bh=Li0Ema6+bPs7bYqwmweKT8bE5GmA1/p9FOfZPhU+rhM=;
	h=From:To:CC:Subject:Date:Message-ID:MIME-Version:Content-Type;
 b=An1Dp8+dISTwHROKYqw2oGQSaihRBiHQ8aq91/u4uS73x4Egx6AZbb/U3QzVC73EUtksCOUhbhdbumTqrs4cFhSE8C1UkWU6vht42RcAoxXnLyG+ULfY/EuleOoVIfg6O+drEliBG2YqKuMJM1IPJo4DnqfcnmG7za9DgtTku18=
ARC-Authentication-Results: i=2; smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amd.com;
 spf=fail smtp.mailfrom=amd.com;
 dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com
 header.b=zUMTw2xc; arc=fail smtp.client-ip=40.107.93.51
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=quarantine dis=none) header.from=amd.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=fail smtp.mailfrom=amd.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=amd.com header.i=@amd.com
 header.b="zUMTw2xc"
ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none;
 b=x1QNSk7DoDRQ8wOw6RNuVXRICjA1NcrfErvWTK6E/yKPiEC5//cH1aPbLAMDGr3PP4QLSep1oDu2cFKqH6X6zYtNOvVEe8POzlrReIGGixynGlDATTBSG884NjBkiWB+Uo7scA/ivEuJz9HZjyXkEE553Xf11Eq1fb7tZy/hD5hZ1cWbfaW9LJoKlWCPzsiWkWwkMvXzaI8x7HJ0G36CF0VpAqcVh2MksFIwE+vR7zLh28A/EyN5VVr1qDqcuH7gUTDM8nE4EqcNYVLGEbhENnSTLHJIXx8ZxlOqVNByoeAF1qz8WN91AsFScaAhS8SdnYsWYm4f8KqiswgoXFql/w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector10001;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=iTeYmQ+ZWTICMbzReG61ggpVMyVGB//hmTuwhQPG4N8=;
 b=ga6BVQjzhVdl1wq5mflS+fLQ/vE7FPJDGv5MKDBJrjxiD54teXWXVKNUC9oUEPhe+s4kS6uiq7LCs1OeLSqJlGzHD8uVbcRII/ZzJfc7VsGGOHM2MmvE/uBttJQPuLNM0+WPoK6alX+5EtvbJZe6bTnKu3zKBSFgaPZFmAaMNgGTAvNocrmVB4Nl9c4Oza1YZWteL/MQ5UUdE0QzTGypyfMl1xjt1qmQCFnEjkugkQOJI2dYhqx2CSp3rE4QBKd3jQkH9+Lo0C0+9bYCLs+Q8iqxdxnuEkg44mFwkhh4kl8BHF63ilppOUeTIxovGOM/dmr37p+QJmZ/07INK3Tdaw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass (sender ip is
 165.204.84.17) smtp.rcpttodomain=cmpxchg.org smtp.mailfrom=amd.com;
 dmarc=pass (p=quarantine sp=quarantine pct=100) action=none
 header.from=amd.com; dkim=none (message not signed); arc=none (0)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amd.com; s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=iTeYmQ+ZWTICMbzReG61ggpVMyVGB//hmTuwhQPG4N8=;
 b=zUMTw2xcHuIcwqC47sn/G9ZxPUG7fqMQg7llRMKAO9tzegUZqm2WVeH44Kl8Gqvi2IHB8jJRhCO7ytqysoPD/704A/IltvH/z6MtD80OayMdCRNgMoMzcoHCBv+iFhrsWryhEeUnjKOpU5TLEEn4chawox9bsucULJpR/3FoGxE=
Received: from SA9PR10CA0005.namprd10.prod.outlook.com (2603:10b6:806:a7::10)
 by IA1PR12MB6651.namprd12.prod.outlook.com (2603:10b6:208:3a0::22) with
 Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.8293.14; Thu, 26 Dec
 2024 05:34:55 +0000
Received: from SA2PEPF00003F67.namprd04.prod.outlook.com
 (2603:10b6:806:a7:cafe::3f) by SA9PR10CA0005.outlook.office365.com
 (2603:10b6:806:a7::10) with Microsoft SMTP Server (version=TLS1_3,
 cipher=TLS_AES_256_GCM_SHA384) id 15.20.8293.14 via Frontend Transport; Thu,
 26 Dec 2024 05:34:55 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 165.204.84.17)
 smtp.mailfrom=amd.com; dkim=none (message not signed)
 header.d=none;dmarc=pass action=none header.from=amd.com;
Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C
Received: from SATLEXMB04.amd.com (165.204.84.17) by
 SA2PEPF00003F67.mail.protection.outlook.com (10.167.248.42) with Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.20.8293.12 via Frontend Transport; Thu, 26 Dec 2024 05:34:55 +0000
Received: from BLRKPRNAYAK.amd.com (10.180.168.240) by SATLEXMB04.amd.com
 (10.181.40.145) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.39; Wed, 25 Dec
 2024 23:34:49 -0600
From: K Prateek Nayak <kprateek.nayak@amd.com>
To: Johannes Weiner <hannes@cmpxchg.org>, Suren Baghdasaryan
	<surenb@google.com>, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra
	<peterz@infradead.org>, Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot
	<vincent.guittot@linaro.org>, <linux-kernel@vger.kernel.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>, Steven Rostedt
	<rostedt@goodmis.org>, Ben Segall <bsegall@google.com>, Mel Gorman
	<mgorman@suse.de>, Valentin Schneider <vschneid@redhat.com>, Chengming Zhou
	<zhouchengming@bytedance.com>, Muchun Song <muchun.song@linux.dev>, "Gautham
 R. Shenoy" <gautham.shenoy@amd.com>, K Prateek Nayak <kprateek.nayak@amd.com>
Subject: [PATCH] psi: Fix race when task wakes up before psi_sched_switch()
 adjusts flags
Date: Thu, 26 Dec 2024 05:34:41 +0000
Message-ID: <20241226053441.1110-1-kprateek.nayak@amd.com>
X-Mailer: git-send-email 2.43.0
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-ClientProxiedBy: SATLEXMB04.amd.com (10.181.40.145) To SATLEXMB04.amd.com
 (10.181.40.145)
X-EOPAttributedMessage: 0
X-MS-PublicTrafficType: Email
X-MS-TrafficTypeDiagnostic: SA2PEPF00003F67:EE_|IA1PR12MB6651:EE_
X-MS-Office365-Filtering-Correlation-Id: a332256a-fc5f-4ecc-d3f1-08dd256f0766
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam: 
	BCL:0;ARA:13230040|7416014|376014|1800799024|36860700013|30052699003|82310400026;
X-Microsoft-Antispam-Message-Info: 
	=?us-ascii?Q?HH9NT3vJxRRVfhtXCnBL+Ku4lGf1OduBQRjmaeXS+A2xwLgZqWgaZck9D5U/?=
 =?us-ascii?Q?fbs4f6r7xMQcSzSc/bVd2iF5Tdx+ztDjWBgVKVEEYpbm4yXHr+jJZ22xlz9o?=
 =?us-ascii?Q?dFiKI9XuNau7dhmahm1niVTkCbni0X3bNHg6ptqCRS6YUhU9wnaIlNsHcKC7?=
 =?us-ascii?Q?0CfCKXVq+K4FX1z3bQIMO7evsPxnCPNZCLQN0BfBiSqY58BKHNNZacWdpgwA?=
 =?us-ascii?Q?tGUW3KTA04LvDg6iPS487hhp/UvCRMWkujwo+7o904LULwFconk5Rwmi7qi+?=
 =?us-ascii?Q?bYQQ1wKSQpskZ61w/nvRg/cI77UuTC8ZcNtOQXBTGx5vQ+zJEG6hbrucvdpu?=
 =?us-ascii?Q?kUHnOyiF6F68xTNbfH9eRBTiaeQENwoHl7tbNWB4rjaZkkVlBhJVt9w7FybG?=
 =?us-ascii?Q?/ss0DZYECvWFE//5NVLuMpPuYC2MA4ktnrckdIM9gp4CuQ3yP5Zj+knyhQyF?=
 =?us-ascii?Q?sY5GTvLaNMAqnqJTGEVcekLBqEEzaSiKuNR3zTIQ6TuxaQx4yFlt/aeSAE//?=
 =?us-ascii?Q?MCSIB6+iG7wii+cargh2UQb4tQn0MgHO5ioojT4mZllx23d7RiSaqk9BB+De?=
 =?us-ascii?Q?BZy0UCZbwIJzDc44AxRqgfknhREHuQK37A/t4k1jEefNMgLSiabCBwDZLm9s?=
 =?us-ascii?Q?ggKNjS258FOVrMCPez0luqRxFdT6hXEGi9/SqF2wAqXCiZ8HUWhOGWBsrO/r?=
 =?us-ascii?Q?n4OjcKBa7Yv59KdFsywIz8qweW4761SLQxPfDcRR0IEUjM/yqtJw/G7yBsMD?=
 =?us-ascii?Q?+BrfvQtkMopZwoH8C3ChM7/JSfNHsVEsaHZaAPT3qUP7EtPdrO7lRAcJZxT7?=
 =?us-ascii?Q?r6g7mIg61pmx2I9ufI213Ssq0gX16MlWLj5e7IaRK+ktMOiuOoxuLfls+bcQ?=
 =?us-ascii?Q?szUmibOMcgS6J1fVpoJvBOwJ8PesSR4hAdY7uqQLuL4cSA+E+1IpZi8aT4CJ?=
 =?us-ascii?Q?L+4M/D5q8NoVWQmQZW+8S3rB//W+a3RSTtn9TQ11oTzVlQWPA/ItHXYG7rxH?=
 =?us-ascii?Q?3qI/fYrBXjXY1nr/nEKqFhGQ1jd9tpM2YL+wp87ML/m0EK0cMEAbtccaMLOP?=
 =?us-ascii?Q?/0IkUQ02os5PYfoxpehshEi93ah7JR6AsmqCbogP0a53wATP7odWLuo1xYcf?=
 =?us-ascii?Q?IcspIHBTorBLU6/hNiGC+fXRWbvr/6Fbzi/7WCv05PmuOGn5dIExUlXnYPGj?=
 =?us-ascii?Q?5FNLWDCzjUV08wGOfDFclKVEwNH8wSjA7v9MK1aNF4ZaAuG5UC3jKLM8+oWz?=
 =?us-ascii?Q?+p0Q1EMkXELqLKEyXd63wAzGHuqOewkt2cy2fEXxVrbDM0oUeaV7DuNrHdjy?=
 =?us-ascii?Q?teAhQ4gJ8DqipdhmqroD4KWAxbcztY62R/lUWb99KV8bt2Ki0nv4kCoetWa8?=
 =?us-ascii?Q?ZTvID0awQlWf8/qIj6jqNZfKwLcH45n3dWOf/Joth2J+lvL5PB/5ssT3ewml?=
 =?us-ascii?Q?GCiOcaVhXaKMEkEcNIM1feQshDPipcypucvv93sWVUuN1LYRrL4+D6IPbVdW?=
 =?us-ascii?Q?nMhbhgJcStx6zRc=3D?=
X-Forefront-Antispam-Report: 
	CIP:165.204.84.17;CTRY:US;LANG:en;SCL:1;SRV:;IPV:CAL;SFV:NSPM;H:SATLEXMB04.amd.com;PTR:InfoDomainNonexistent;CAT:NONE;SFS:(13230040)(7416014)(376014)(1800799024)(36860700013)(30052699003)(82310400026);DIR:OUT;SFP:1101;
X-OriginatorOrg: amd.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 26 Dec 2024 05:34:55.1474
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 a332256a-fc5f-4ecc-d3f1-08dd256f0766
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d;Ip=[165.204.84.17];Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	SA2PEPF00003F67.namprd04.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: IA1PR12MB6651
Content-Type: text/plain; charset="utf-8"

When running hackbench in a cgroup with bandwidth throttling enabled,
following PSI splat was observed:

    psi: inconsistent task state! task=3D1831:hackbench cpu=3D8 psi_flags=
=3D14 clear=3D0 set=3D4

When investigating the series of events leading up to the splat,
following sequence was observed:
    [008] d..2.: sched_switch: ... =3D=3D> next_comm=3Dhackbench next_pid=
=3D1831 next_prio=3D120
        ...
    [008] dN.2.: dequeue_entity(task delayed): task=3Dhackbench pid=3D1831 =
cfs_rq->throttled=3D0
    [008] dN.2.: pick_task_fair: check_cfs_rq_runtime() throttled cfs_rq on=
 CPU8
    # CPU8 goes into newidle balance and releases the rq lock
        ...
    # CPU15 on same LLC Domain is trying to wakeup hackbench(pid=3D1831)
    [015] d..4.: psi_flags_change: psi: task state: task=3D1831:hackbench c=
pu=3D8 psi_flags=3D14 clear=3D0 set=3D4 final=3D14 # Splat (cfs_rq->throttl=
ed=3D1)
    [015] d..4.: sched_wakeup: comm=3Dhackbench pid=3D1831 prio=3D120 targe=
t_cpu=3D008 # Task has woken on a throttled hierarchy
    [008] d..2.: sched_switch: prev_comm=3Dhackbench prev_pid=3D1831 prev_p=
rio=3D120 prev_state=3DS =3D=3D> ...

psi_dequeue() relies on psi_sched_switch() to set the correct PSI flags
for the blocked entity, however, the following race is possible with
psi_enqueue() / psi_ttwu_dequeue() in the path from psi_dequeue() to
psi_sched_switch()

    __schedule()
	rq_lock(rq)
	    try_to_block_task(p)
		psi_dequeue()
		[ psi_task_switch() is responsible
		  for adjusting the PSI flags ]
	    put_prev_entity(&p->se)			try_to_wake_up(p)
	    # no runnable task on rq->cfs		    ...
	    sched_balance_newidle()
		raw_spin_rq_unlock(rq)			    __task_rq_lock(p)
		...						psi_enqueue()/psi_ttwu_dequeue() [Woops!]
							    __task_rq_unlock(p)
		raw_spin_rq_lock(rq)
	    ...
	    [ p was re-enqueued or has migrated away ]
	    ...
	    psi_task_switch() [Too late!]
	raw_spin_rq_unlock(rq)

The wakeup context will see the flags for a running task when the flags
should have reflected the task being blocked. Similarly, a migration
context in the wakeup path can clear the flags that psi_sched_switch()
assumes will be set (TSK_ONCPU / TSK_RUNNING)

Since the TSK_ONCPU flag has to be modified with the rq lock of
task_cpu() held, use a combination of task_cpu() and TSK_ONCPU checks to
prevent the race. Specifically:

o psi_enqueue() will clear the TSK_ONCPU flag when it encounters one.
  psi_enqueue() will only be called with TSK_ONCPU set when the task is
  being requeued on the same CPU. If the task was migrated,
  psi_ttwu_dequeue() would have already cleared the PSI flags.

  psi_enqueue() cannot guarantee that this same task will be picked
  again when the scheduling CPU returns from newidle balance which is
  why it clears the TSK_ONCPU to mimic a net result of sleep + wakeup
  without migration.

o When psi_sched_switch() observes that prev's task_cpu() has changes or
  the TSK_ONCPU flag is not set, a wakeup has raced with the
  psi_sched_switch() trying to adjust the dequeue flag. If the next is
  same as the prev, psi_sched_switch() has to now set the TSK_ONCPU flag
  again. Otherwise, psi_enqueue() or psi_ttwu_dequeue() would have
  already adjusted the PSI flags and no further changes are required
  to prev's PSI flags.

With the introduction of DELAY_DEQUEUE, the requeue path is considerably
shortened and with the addition of bandwidth throttling in the
__schedule() path, the race window is large enough to observed this
issue.

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
This patch is based on tip:sched/core at commit af98d8a36a96
("sched/fair: Fix CPU bandwidth limit bypass during CPU hotplug")

Reproducer for the PSI splat:

  mkdir /sys/fs/cgroup/test
  echo $$ > /sys/fs/cgroup/test/cgroup.procs
  # Ridiculous limit on SMP to throttle multiple rqs at once
  echo "50000 100000" > /sys/fs/cgroup/test/cpu.max
  perf bench sched messaging -t -p -l 100000 -g 16

This worked reliably on my 3rd Generation EPYC System (2 x 64C/128T) but
also on a 32 vCPU VM.
---
 kernel/sched/core.c  |  7 ++++-
 kernel/sched/psi.c   | 65 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/stats.h | 16 ++++++++++-
 3 files changed, 83 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 84902936a620..9bbe51e44e98 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6717,6 +6717,12 @@ static void __sched notrace __schedule(int sched_mod=
e)
 	rq->last_seen_need_resched_ns =3D 0;
 #endif
=20
+	/*
+	 * PSI might have to deal with the consequences of newidle balance
+	 * possibly dropping the rq lock and prev being requeued and selected.
+	 */
+	psi_sched_switch(prev, next, block);
+
 	if (likely(prev !=3D next)) {
 		rq->nr_switches++;
 		/*
@@ -6750,7 +6756,6 @@ static void __sched notrace __schedule(int sched_mode)
=20
 		migrate_disable_switch(rq, prev);
 		psi_account_irqtime(rq, prev, next);
-		psi_sched_switch(prev, next, block);
=20
 		trace_sched_switch(preempt, prev, next, prev_state);
=20
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 84dad1511d1e..c355a6189595 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -917,9 +917,21 @@ void psi_task_switch(struct task_struct *prev, struct =
task_struct *next,
 		     bool sleep)
 {
 	struct psi_group *group, *common =3D NULL;
-	int cpu =3D task_cpu(prev);
+	int prev_cpu, cpu;
+
+	/* No race between psi_dequeue() and now */
+	if (prev =3D=3D next && (prev->psi_flags & TSK_ONCPU))
+		return;
+
+	prev_cpu =3D task_cpu(prev);
+	cpu =3D smp_processor_id();
=20
 	if (next->pid) {
+		/*
+		 * If next =3D=3D prev but TSK_ONCPU is cleared, the task was
+		 * requeued when newidle balance dropped the rq lock and
+		 * psi_enqueue() cleared the TSK_ONCPU flag.
+		 */
 		psi_flags_change(next, 0, TSK_ONCPU);
 		/*
 		 * Set TSK_ONCPU on @next's cgroups. If @next shares any
@@ -928,8 +940,13 @@ void psi_task_switch(struct task_struct *prev, struct =
task_struct *next,
 		 */
 		group =3D task_psi_group(next);
 		do {
-			if (per_cpu_ptr(group->pcpu, cpu)->state_mask &
-			    PSI_ONCPU) {
+			/*
+			 * Since newidle balance can drop the rq lock (see the next comment)
+			 * there is a possibility of try_to_wake_up() migrating prev away
+			 * before reaching here. Do not find common if task has migrated.
+			 */
+			if (prev_cpu =3D=3D cpu &&
+			    (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU)) {
 				common =3D group;
 				break;
 			}
@@ -938,6 +955,48 @@ void psi_task_switch(struct task_struct *prev, struct =
task_struct *next,
 		} while ((group =3D group->parent));
 	}
=20
+	/*
+	 * When a task is blocked, psi_dequeue() leaves the PSI flag
+	 * adjustments to psi_task_switch() however, there is a possibility of
+	 * rq lock being dropped in the interim and the task being woken up
+	 * again before psi_task_switch() is called leading to psi_enqueue()
+	 * seeing the flags for a running task. Specifically, the following
+	 * scenario is possible:
+	 *
+	 * __schedule()
+	 *   rq_lock(rq)
+	 *     try_to_block_task(p)
+	 *       psi_dequeue()
+	 *        [ psi_task_switch() is responsible
+	 *          for adjusting the PSI flags ]
+	 *     put_prev_entity(&p->se)			try_to_wake_up(p)
+	 *     # no runnable task on rq->cfs		  ...
+	 *     sched_balance_newidle()
+	 *	 raw_spin_rq_unlock(rq)			  __task_rq_lock(p)
+	 *	 ...					  psi_enqueue()/psi_ttwu_dequeue() [Woops!]
+	 *						  __task_rq_unlock(p)
+	 *	 raw_spin_rq_lock(rq)
+	 *     ...
+	 *     [ p was re-enqueued or has migrated away ]
+	 *     ...
+	 *     psi_task_switch() [Too late!]
+	 *   raw_spin_rq_unlock(rq)
+	 *
+	 * In the above case, psi_enqueue() can sees the p->psi_flags state
+	 * before it is adjusted to account for dequeue in psi_task_switch(),
+	 * or psi_ttwu_dequeue() can clear the p->psi_flags which
+	 * psi_task_switch() tries to adjust assuming that the entity has just
+	 * finished running.
+	 *
+	 * Since TSK_ONCPU has to be adjusted holding task CPU's rq lock, use
+	 * the combination of TSK_ONCPU and task_cpu(p) to catch the race
+	 * between psi_task_switch() and psi_enqueue() / psi_ttwu_dequeue()
+	 * Since psi_enqueue() / psi_ttwu_dequeue() would have set the correct
+	 * flags already for prev on this CPU, skip adjusting flags.
+	 */
+	if (prev =3D=3D next || prev_cpu !=3D cpu || !(prev->psi_flags & TSK_ONCP=
U))
+		return;
+
 	if (prev->pid) {
 		int clear =3D TSK_ONCPU, set =3D 0;
 		bool wake_clock =3D true;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index 8ee0add5a48a..f09903165456 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -138,7 +138,21 @@ static inline void psi_enqueue(struct task_struct *p, =
int flags)
 	if (flags & ENQUEUE_RESTORE)
 		return;
=20
-	if (p->se.sched_delayed) {
+	if (p->psi_flags & TSK_ONCPU) {
+		/*
+		 * psi_enqueue() can race with psi_task_switch() where
+		 * TSK_ONCPU will be still set for the task (see the
+		 * comment in psi_task_switch())
+		 *
+		 * Reaching here with TSK_ONCPU is only possible when
+		 * the task is being enqueued on the same CPU. Since
+		 * psi_task_switch() has not had the chance to adjust
+		 * the flags yet, just clear the TSK_ONCPU which yields
+		 * the same result as sleep + wakeup without migration.
+		 */
+		SCHED_WARN_ON(flags & ENQUEUE_MIGRATED);
+		clear =3D TSK_ONCPU;
+	} else if (p->se.sched_delayed) {
 		/* CPU migration of "sleeping" task */
 		SCHED_WARN_ON(!(flags & ENQUEUE_MIGRATED));
 		if (p->in_memstall)

base-commit: af98d8a36a963e758e84266d152b92c7b51d4ecb
--=20
2.34.1