From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D722EC77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 05:59:04 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231189AbjFAF7D (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 01:59:03 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33060 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231276AbjFAF7A (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:00 -0400
Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com
 [IPv6:2607:f8b0:4864:20::44a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D60F3C9
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:58:57 -0700 (PDT)
Received: by mail-pf1-x44a.google.com with SMTP id
 d2e1a72fcca58-64d413b27a1so196322b3a.2
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:58:57 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599137; x=1688191137;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=H64rb2yYp8MOyz/roqnSkHpqb1clNb2Uq7VZzIq9jAQ=;
        b=gPfRmIMyXNJ7efaEUgu5j8YwEfzpoLjve3VskawV/jLJlUvCL7giRctbP6w37UsKY3
         C/V+PiD8OLEMTg31u46QJe9l2AZjPLKZNa8nG06AY8lxJFYgN6mCN78/pyj8sqyJbF84
         EjFtaFOkb0W3aDPM5hbTvDMwpZst3KzWKqrfJ5lwqZl6bkAimUAh69XFvquuaw3fYGeR
         KVpFijdNeJmyZ7ZgZax0dzUs7GltsTPLZTi3vxB0aYMGNGQ2M3lagRPptcPjwzZf9ZNx
         G8HxCgjZKmBaNLkAv2d+TViyNtH8efgwLm7no7gNyri+/d4akl6/JblCKHzFEpOm2bnf
         INYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599137; x=1688191137;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=H64rb2yYp8MOyz/roqnSkHpqb1clNb2Uq7VZzIq9jAQ=;
        b=RZmFdrI5qUosrAgLJBB2SUaSF6r4vczLFqXRDQZoUXzWWSru6zGWkr2n2G/B98OQL0
         EXf+yG/cNFI92hqleLmjGz61cmgf0tJWwJfRgMXWQtwJ4UEIw5SRa6HiIqEktsR3YvSf
         IJ/JMtJRoYLPXUJxTXNW+WpPTPQyKwv+8NCv6W9+/Q+dM5SQFTRnBLXd4zjSooJpIioW
         eYZZl1rf9q/1+7ZUZYk3UFmfauvvYDFeBAgf7eNd8aiAL95ABD819SDjhLgEw64oFFGB
         1Y3bJ5dES5V5QVaB2WMkWiCtMh9ucbydN0OJBb4+MxCWpNrJvOgCxEGa7LBJ2k0pma2r
         rsog==
X-Gm-Message-State: AC+VfDyi3cQRA6bkIHDhjCs/UF7yGLAKma4k7biqk7ElTLNGx4gI7XCM
        ewOAH512gwg9kgkJjihI9xtDyBgAhMp0KN8CgoS8bAhPPOE+lBx1UvqPY6uUJhQ9SsmhsB/hnVu
        Gsm+T0utRvyp2p8Do+M3DeDbuEpJmFDvqOOmlMsNMqlPq09s31DVqMs6xoNe0z78KXcYS/p4=
X-Google-Smtp-Source: 
 ACHHUZ4vUQPiCvXQyNMQZfB0opFPern+Lgx904uhkZ9TAFaKvoAX+Zu6BDNVe+kC3qkit5WgLbDP3z3zXizn
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6a00:2e24:b0:64d:2cb0:c60c with SMTP
 id fc36-20020a056a002e2400b0064d2cb0c60cmr2986111pfb.5.1685599137166; Wed, 31
 May 2023 22:58:57 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:04 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-2-jstultz@google.com>
Subject: [PATCH v4 01/13] sched: Unify runtime accounting across classes
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

All classes use sched_entity::exec_start to track runtime and have
copies of the exact same code around to compute runtime.

Collapse all that.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[fix conflicts, fold in update_current_exec_runtime]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: rebased, resovling minor conflicts]
Signed-off-by: John Stultz <jstultz@google.com>
---
NOTE: This patch is a general cleanup and if no one objects
could be merged at this point. If needed, I'll resend separately
if it isn't picked up on its own.
---
 include/linux/sched.h    |  2 +-
 kernel/sched/deadline.c  | 13 +++-------
 kernel/sched/fair.c      | 56 ++++++++++++++++++++++++++++++----------
 kernel/sched/rt.c        | 13 +++-------
 kernel/sched/sched.h     | 12 ++-------
 kernel/sched/stop_task.c | 13 +---------
 6 files changed, 52 insertions(+), 57 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eed5d65b8d1f..37dd571a1246 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -521,7 +521,7 @@ struct sched_statistics {
 	u64				block_max;
 	s64				sum_block_runtime;
=20
-	u64				exec_max;
+	s64				exec_max;
 	u64				slice_max;
=20
 	u64				nr_migrations_cold;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 5a9a4b81c972..f6f746d52410 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1308,9 +1308,8 @@ static void update_curr_dl(struct rq *rq)
 {
 	struct task_struct *curr =3D rq->curr;
 	struct sched_dl_entity *dl_se =3D &curr->dl;
-	u64 delta_exec, scaled_delta_exec;
+	s64 delta_exec, scaled_delta_exec;
 	int cpu =3D cpu_of(rq);
-	u64 now;
=20
 	if (!dl_task(curr) || !on_dl_rq(dl_se))
 		return;
@@ -1323,21 +1322,15 @@ static void update_curr_dl(struct rq *rq)
 	 * natural solution, but the full ramifications of this
 	 * approach need further study.
 	 */
-	now =3D rq_clock_task(rq);
-	delta_exec =3D now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <=3D 0)) {
+	delta_exec =3D update_curr_common(rq);
+	if (unlikely(delta_exec <=3D 0)) {
 		if (unlikely(dl_se->dl_yielded))
 			goto throttle;
 		return;
 	}
=20
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
 	trace_sched_stat_runtime(curr, delta_exec, 0);
=20
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (dl_entity_is_special(dl_se))
 		return;
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..bf9e8f29398e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,23 +891,17 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
=20
-/*
- * Update the current task's runtime statistics.
- */
-static void update_curr(struct cfs_rq *cfs_rq)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
 {
-	struct sched_entity *curr =3D cfs_rq->curr;
-	u64 now =3D rq_clock_task(rq_of(cfs_rq));
-	u64 delta_exec;
-
-	if (unlikely(!curr))
-		return;
+	u64 now =3D rq_clock_task(rq);
+	s64 delta_exec;
=20
 	delta_exec =3D now - curr->exec_start;
-	if (unlikely((s64)delta_exec <=3D 0))
-		return;
+	if (unlikely(delta_exec <=3D 0))
+		return delta_exec;
=20
 	curr->exec_start =3D now;
+	curr->sum_exec_runtime +=3D delta_exec;
=20
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
@@ -917,9 +911,43 @@ static void update_curr(struct cfs_rq *cfs_rq)
 				max(delta_exec, stats->exec_max));
 	}
=20
-	curr->sum_exec_runtime +=3D delta_exec;
-	schedstat_add(cfs_rq->exec_clock, delta_exec);
+	return delta_exec;
+}
+
+/*
+ * Used by other classes to account runtime.
+ */
+s64 update_curr_common(struct rq *rq)
+{
+	struct task_struct *curr =3D rq->curr;
+	s64 delta_exec;
=20
+	delta_exec =3D update_curr_se(rq, &curr->se);
+	if (unlikely(delta_exec <=3D 0))
+		return delta_exec;
+
+	account_group_exec_runtime(curr, delta_exec);
+	cgroup_account_cputime(curr, delta_exec);
+
+	return delta_exec;
+}
+
+/*
+ * Update the current task's runtime statistics.
+ */
+static void update_curr(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *curr =3D cfs_rq->curr;
+	s64 delta_exec;
+
+	if (unlikely(!curr))
+		return;
+
+	delta_exec =3D update_curr_se(rq_of(cfs_rq), curr);
+	if (unlikely(delta_exec <=3D 0))
+		return;
+
+	schedstat_add(cfs_rq->exec_clock, delta_exec);
 	curr->vruntime +=3D calc_delta_fair(delta_exec, curr);
 	update_min_vruntime(cfs_rq);
=20
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 00e0e5074115..0d0b276c447d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1046,24 +1046,17 @@ static void update_curr_rt(struct rq *rq)
 {
 	struct task_struct *curr =3D rq->curr;
 	struct sched_rt_entity *rt_se =3D &curr->rt;
-	u64 delta_exec;
-	u64 now;
+	s64 delta_exec;
=20
 	if (curr->sched_class !=3D &rt_sched_class)
 		return;
=20
-	now =3D rq_clock_task(rq);
-	delta_exec =3D now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec <=3D 0))
+	delta_exec =3D update_curr_common(rq);
+	if (unlikely(delta_exec < 0))
 		return;
=20
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
 	trace_sched_stat_runtime(curr, delta_exec, 0);
=20
-	update_current_exec_runtime(curr, now, delta_exec);
-
 	if (!rt_bandwidth_enabled())
 		return;
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec7b3e0a2b20..4a1ef64449b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2166,6 +2166,8 @@ struct affinity_context {
 	unsigned int flags;
 };
=20
+extern s64 update_curr_common(struct rq *rq);
+
 struct sched_class {
=20
 #ifdef CONFIG_UCLAMP_TASK
@@ -3242,16 +3244,6 @@ extern int sched_dynamic_mode(const char *str);
 extern void sched_dynamic_update(int mode);
 #endif
=20
-static inline void update_current_exec_runtime(struct task_struct *curr,
-						u64 now, u64 delta_exec)
-{
-	curr->se.sum_exec_runtime +=3D delta_exec;
-	account_group_exec_runtime(curr, delta_exec);
-
-	curr->se.exec_start =3D now;
-	cgroup_account_cputime(curr, delta_exec);
-}
-
 #ifdef CONFIG_SCHED_MM_CID
=20
 #define SCHED_MM_CID_PERIOD_NS	(100ULL * 1000000)	/* 100ms */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 85590599b4d6..7595494ceb6d 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -70,18 +70,7 @@ static void yield_task_stop(struct rq *rq)
=20
 static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *curr =3D rq->curr;
-	u64 now, delta_exec;
-
-	now =3D rq_clock_task(rq);
-	delta_exec =3D now - curr->se.exec_start;
-	if (unlikely((s64)delta_exec < 0))
-		delta_exec =3D 0;
-
-	schedstat_set(curr->stats.exec_max,
-		      max(curr->stats.exec_max, delta_exec));
-
-	update_current_exec_runtime(curr, now, delta_exec);
+	update_curr_common(rq);
 }
=20
 /*
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 62ACAC77B7E
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 05:59:13 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231261AbjFAF7L (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 01:59:11 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33074 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231346AbjFAF7C (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:02 -0400
Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com
 [IPv6:2607:f8b0:4864:20::649])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B76B3E2
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:00 -0700 (PDT)
Received: by mail-pl1-x649.google.com with SMTP id
 d9443c01a7336-1b04dbcf0dbso11627185ad.1
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:00 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599140; x=1688191140;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=FaQ7YZpF/a7XORZFgvpD5+n/cW3rFAGF9tyaRnUZ5sQ=;
        b=PHMTWzkkY25hOPAv9d+K9oomvi0+2+zzYTPyqcbL9XE7laYRcOGfak4pVh55ON2xVB
         E7pFkIzp2gSGJGq41w6Urs7YP2oGAAtai2qfJuztgcDLqvsubulaYwJ+O3xs3jCEPup0
         kX4b9jyPGpPYzrb2bI3vhvGvRIhII70pB/KUa69cfodkCH0j2oUL1AR11LH/4cuz+CUI
         UFBP0pVcTVsi0gb5G2sIFOEug55lSkMRivzYEyerlDEaQbqAXcE3gyJnWF7fS/812wWv
         CO6b3Nku4VDqaKDoOvaQfM9Q2GTg1sM4U6zs+ezsayB6VeFkx/LguGX2LEx/ee96sWpx
         hyXA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599140; x=1688191140;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=FaQ7YZpF/a7XORZFgvpD5+n/cW3rFAGF9tyaRnUZ5sQ=;
        b=DHUH6LutLVFP1+DfXetfRT77ecJ9UJAOUmHe6NHX2MjRGO7guz1X89VKkoPgDSznWg
         z4bZrW8JYuDDSjSFrhlPMocSwG9b0sfAE0BMWj72wBdFSktB4xBaNEp8a9i4lxjYZlYY
         m7K9ycqP41tGSLa2hQGUvWAM/6kEQoVkj9O+6f5VnoZ8tGMGhtHoMAgZW0zJNZ1THBtI
         wpsCsZOqXW+59WMQO3gJaUG2vuD+61Mep8mTx9ykhvBBn3nWs7fpBMF/4hmiOpP0w5Mz
         9caOVPCEaldbp727YMuMlNazbQTTirc3fxdpeZ/NuiMUes+I5ULxAitpM4MLYgzLbrP2
         RaQA==
X-Gm-Message-State: AC+VfDz8axEMFTJIRlXp+n4R/6IePojORrTyIzjuC8AMyOuoX773f7Ho
        cowv31AhW7ZmtsADC4SFxPXlmQOEUzpVYKhdqJN2NZdzsoIV8VjsT02zM3yfmGWX1eAfj+0ji42
        8UbM/YXtlVucg2Jgl39+bLcFMZDL+2DjQkZSji+M7fw15ZH8QsQT5yIxVoiA5BPJvtHyDF24=
X-Google-Smtp-Source: 
 ACHHUZ6HpMMjpu8n5u2KFXjOvtyPTw2eZJoxVN8kwfJg896ESK42l3fkGQaU3GakiUKI4K8S1L/Jq91ZR9R5
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a17:902:a5c9:b0:1a6:8a07:960a with SMTP id
 t9-20020a170902a5c900b001a68a07960amr187334plq.0.1685599139086; Wed, 31 May
 2023 22:58:59 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:05 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-3-jstultz@google.com>
Subject: [PATCH v4 02/13] locking/ww_mutex: Remove wakeups from under
 mutex::wait_lock
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

In preparation to nest mutex::wait_lock under rq::lock we need to remove
wakeups from under it.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Move wake_q_init() as suggested by Waiman Long
---
 include/linux/ww_mutex.h  |  3 +++
 kernel/locking/mutex.c    |  8 ++++++++
 kernel/locking/ww_mutex.h | 10 ++++++++--
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h
index bb763085479a..9335b2202017 100644
--- a/include/linux/ww_mutex.h
+++ b/include/linux/ww_mutex.h
@@ -19,6 +19,7 @@
=20
 #include <linux/mutex.h>
 #include <linux/rtmutex.h>
+#include <linux/sched/wake_q.h>
=20
 #if defined(CONFIG_DEBUG_MUTEXES) || \
    (defined(CONFIG_PREEMPT_RT) && defined(CONFIG_DEBUG_RT_MUTEXES))
@@ -58,6 +59,7 @@ struct ww_acquire_ctx {
 	unsigned int acquired;
 	unsigned short wounded;
 	unsigned short is_wait_die;
+	struct wake_q_head wake_q;
 #ifdef DEBUG_WW_MUTEXES
 	unsigned int done_acquire;
 	struct ww_class *ww_class;
@@ -137,6 +139,7 @@ static inline void ww_acquire_init(struct ww_acquire_ct=
x *ctx,
 	ctx->acquired =3D 0;
 	ctx->wounded =3D false;
 	ctx->is_wait_die =3D ww_class->is_wait_die;
+	wake_q_init(&ctx->wake_q);
 #ifdef DEBUG_WW_MUTEXES
 	ctx->ww_class =3D ww_class;
 	ctx->done_acquire =3D 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d973fe6041bf..1582756914df 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -676,6 +676,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 		}
=20
 		raw_spin_unlock(&lock->wait_lock);
+		if (ww_ctx)
+			ww_ctx_wake(ww_ctx);
 		schedule_preempt_disabled();
=20
 		first =3D __mutex_waiter_is_first(lock, &waiter);
@@ -725,6 +727,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 		ww_mutex_lock_acquired(ww, ww_ctx);
=20
 	raw_spin_unlock(&lock->wait_lock);
+	if (ww_ctx)
+		ww_ctx_wake(ww_ctx);
 	preempt_enable();
 	return 0;
=20
@@ -736,6 +740,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	raw_spin_unlock(&lock->wait_lock);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
+	if (ww_ctx)
+		ww_ctx_wake(ww_ctx);
 	preempt_enable();
 	return ret;
 }
@@ -946,9 +952,11 @@ static noinline void __sched __mutex_unlock_slowpath(s=
truct mutex *lock, unsigne
 	if (owner & MUTEX_FLAG_HANDOFF)
 		__mutex_handoff(lock, next);
=20
+	preempt_disable();
 	raw_spin_unlock(&lock->wait_lock);
=20
 	wake_up_q(&wake_q);
+	preempt_enable();
 }
=20
 #ifndef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 56f139201f24..e49ea5336473 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -161,6 +161,12 @@ static inline void lockdep_assert_wait_lock_held(struc=
t rt_mutex *lock)
=20
 #endif /* WW_RT */
=20
+void ww_ctx_wake(struct ww_acquire_ctx *ww_ctx)
+{
+	wake_up_q(&ww_ctx->wake_q);
+	wake_q_init(&ww_ctx->wake_q);
+}
+
 /*
  * Wait-Die:
  *   The newer transactions are killed when:
@@ -284,7 +290,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER =
*waiter,
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
-		wake_up_process(waiter->task);
+		wake_q_add(&ww_ctx->wake_q, waiter->task);
 	}
=20
 	return true;
@@ -331,7 +337,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner !=3D current)
-			wake_up_process(owner);
+			wake_q_add(&ww_ctx->wake_q, owner);
=20
 		return true;
 	}
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4628DC77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 05:59:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231588AbjFAF7Z (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 01:59:25 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33100 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231418AbjFAF7D (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:03 -0400
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DFC91101
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:01 -0700 (PDT)
Received: by mail-yb1-xb4a.google.com with SMTP id
 3f1490d57ef6-bb0d11a56abso705940276.2
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599141; x=1688191141;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=+xkolKmWTL+2iTTj7TXGWopHIMpr9SLMisRaLvY4Yfg=;
        b=TJ2WdrIOapOqDtc5nnMpVStH5sUL4KPu3fSBsJOowo5gg58XLqCieyzOiNurx/rGVB
         5ki6fTVyUa2m5wu2wQgdju6KnnRPtfhnZGOpsSbYbBypwsMWbB5d8eMeKSeN6RKI/luW
         n9GPhsnzcoV+M/fMxspULPbqktxIWUHrevRG64708Ucevupk1M0K4xp0WIvpUHOuH9ht
         e4tfy8sqgW/I4aUm3OpWbLdEssJziaSkzqA4JroBP7Xc/X1y48k4LBZ8Nu6ezp3RkCMj
         10Bn8JyHQDZ5/IjpQB/opv7yR1EUkdTS0OWf8SFgoJLwiNzpIfotO7x5Gq9Urg5Z9cxA
         ON/g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599141; x=1688191141;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=+xkolKmWTL+2iTTj7TXGWopHIMpr9SLMisRaLvY4Yfg=;
        b=GnWljRKZnRWWp+N8j7BLqqPzS0wlpD4N8NMnNmjQ4q0h0xfVuwuQ8AtsP9tRZPyNIE
         xYNX5dBocxpQcF4kw1olhAoqZFG82jZNTgUPCNLb64hd5zoMLYt5ZY+jhGESQ5OKEnUr
         QlC66+QBy1lto72hdIF1EAo8oxRcQ0je7C4vGvQBKKdENCvFMoSDW6A8ZSHNs5nnFHZ4
         gDleCRkiPgfs8NPSCusdnhTMWbnLqj0HtzBcdmQdScavpwaptfAZ0wXthH6MIqMX4e8o
         3WCvzmyV+PNdZESEbqQlVh3jNzh2hVL3SuOO+WtvKmxuDR/DUmpj0nt836KUmzcLPgNP
         ypeQ==
X-Gm-Message-State: AC+VfDzVCLbtCLab+zWotn8xi5TMRGA2Oj0Oqa6V2lg+oSvgKDfcSvAL
        O9qO/UtvZxwEtyzUcaOIjrXkgwDNPygoeLAMU2bp+95Wb01ScR3dSN/euF9TAT8uA4O7JeyMGiE
        Z2vF0eCnd7/bwpMj1+3mmZoyXX79WOqhD+A3lG+GxLb4lMR9Eci0T3Cmah0GEHzM7quPSY/Y=
X-Google-Smtp-Source: 
 ACHHUZ7RjU2Wbk1cODPBYFd/sY2vVdY7ulEdbcPBSRK91GlEgn/QzV7M2kBp+QmKmo16XS+lYaH7Rcw0+rVp
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a25:fe12:0:b0:ba8:4c16:78b7 with SMTP id
 k18-20020a25fe12000000b00ba84c1678b7mr4704158ybe.12.1685599141065; Wed, 31
 May 2023 22:59:01 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:06 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-4-jstultz@google.com>
Subject: [PATCH v4 03/13] locking/mutex: make mutex::wait_lock irq safe
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Juri Lelli <juri.lelli@redhat.com>

mutex::wait_lock might be nested under rq->lock.

Make it irq safe then.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebase & fix {un,}lock_wait_lock helpers in ww_mutex.h]
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Re-added this patch after it was dropped in v2 which
  caused lockdep warnings to trip.
---
 kernel/locking/mutex.c    | 18 ++++++++++--------
 kernel/locking/ww_mutex.h | 22 ++++++++++++----------
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 1582756914df..a528e7f42caa 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -572,6 +572,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 {
 	struct mutex_waiter waiter;
 	struct ww_mutex *ww;
+	unsigned long flags;
 	int ret;
=20
 	if (!use_ww_ctx)
@@ -614,7 +615,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 		return 0;
 	}
=20
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -675,7 +676,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 				goto err;
 		}
=20
-		raw_spin_unlock(&lock->wait_lock);
+		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		if (ww_ctx)
 			ww_ctx_wake(ww_ctx);
 		schedule_preempt_disabled();
@@ -698,9 +699,9 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
=20
-		raw_spin_lock(&lock->wait_lock);
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
 	__set_current_state(TASK_RUNNING);
=20
@@ -726,7 +727,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
=20
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	if (ww_ctx)
 		ww_ctx_wake(ww_ctx);
 	preempt_enable();
@@ -737,7 +738,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
 	trace_contention_end(lock, ret);
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
 	if (ww_ctx)
@@ -909,6 +910,7 @@ static noinline void __sched __mutex_unlock_slowpath(st=
ruct mutex *lock, unsigne
 	struct task_struct *next =3D NULL;
 	DEFINE_WAKE_Q(wake_q);
 	unsigned long owner;
+	unsigned long flags;
=20
 	mutex_release(&lock->dep_map, ip);
=20
@@ -935,7 +937,7 @@ static noinline void __sched __mutex_unlock_slowpath(st=
ruct mutex *lock, unsigne
 		}
 	}
=20
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
 	if (!list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
@@ -953,7 +955,7 @@ static noinline void __sched __mutex_unlock_slowpath(st=
ruct mutex *lock, unsigne
 		__mutex_handoff(lock, next);
=20
 	preempt_disable();
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
=20
 	wake_up_q(&wake_q);
 	preempt_enable();
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index e49ea5336473..984a4e0bff36 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -70,14 +70,14 @@ __ww_mutex_has_waiters(struct mutex *lock)
 	return atomic_long_read(&lock->owner) & MUTEX_FLAG_WAITERS;
 }
=20
-static inline void lock_wait_lock(struct mutex *lock)
+static inline void lock_wait_lock(struct mutex *lock, unsigned long *flags)
 {
-	raw_spin_lock(&lock->wait_lock);
+	raw_spin_lock_irqsave(&lock->wait_lock, *flags);
 }
=20
-static inline void unlock_wait_lock(struct mutex *lock)
+static inline void unlock_wait_lock(struct mutex *lock, unsigned long flag=
s)
 {
-	raw_spin_unlock(&lock->wait_lock);
+	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 }
=20
 static inline void lockdep_assert_wait_lock_held(struct mutex *lock)
@@ -144,14 +144,14 @@ __ww_mutex_has_waiters(struct rt_mutex *lock)
 	return rt_mutex_has_waiters(&lock->rtmutex);
 }
=20
-static inline void lock_wait_lock(struct rt_mutex *lock)
+static inline void lock_wait_lock(struct rt_mutex *lock, unsigned long *fl=
ags)
 {
-	raw_spin_lock(&lock->rtmutex.wait_lock);
+	raw_spin_lock_irqsave(&lock->rtmutex.wait_lock, *flags);
 }
=20
-static inline void unlock_wait_lock(struct rt_mutex *lock)
+static inline void unlock_wait_lock(struct rt_mutex *lock, flags)
 {
-	raw_spin_unlock(&lock->rtmutex.wait_lock);
+	raw_spin_unlock_irqrestore(&lock->rtmutex.wait_lock, flags);
 }
=20
 static inline void lockdep_assert_wait_lock_held(struct rt_mutex *lock)
@@ -383,6 +383,8 @@ __ww_mutex_check_waiters(struct MUTEX *lock, struct ww_=
acquire_ctx *ww_ctx)
 static __always_inline void
 ww_mutex_set_context_fastpath(struct ww_mutex *lock, struct ww_acquire_ctx=
 *ctx)
 {
+	unsigned long flags;
+
 	ww_mutex_lock_acquired(lock, ctx);
=20
 	/*
@@ -410,9 +412,9 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, st=
ruct ww_acquire_ctx *ctx)
 	 * Uh oh, we raced in fastpath, check if any of the waiters need to
 	 * die or wound us.
 	 */
-	lock_wait_lock(&lock->base);
+	lock_wait_lock(&lock->base, &flags);
 	__ww_mutex_check_waiters(&lock->base, ctx);
-	unlock_wait_lock(&lock->base);
+	unlock_wait_lock(&lock->base, flags);
 }
=20
 static __always_inline int
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 90A5CC77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 05:59:31 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231665AbjFAF73 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 01:59:29 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33402 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S230492AbjFAF7W (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:22 -0400
Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com
 [IPv6:2607:f8b0:4864:20::b49])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C030912C
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:03 -0700 (PDT)
Received: by mail-yb1-xb49.google.com with SMTP id
 3f1490d57ef6-ba83a9779f3so759725276.1
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599143; x=1688191143;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=AyD0cN67LNIzuGBK8Xxb/HVicw5rCOl2zKHQGUMlrFI=;
        b=xoX+58EHDllSIt4VsQ+j6JtTLwO4g9lc0odrIy/UIjjo/FCr9edTouKiiGC4zWnmoG
         Z20mDWvK5aIN9y1CgfSwAj2RDB1SbruDBCUvkI5Bo7ZqvTfIKvA1+hBRzvayrMY3Kwoo
         sWPXPVp49m1h8snFucliYFwPDffiILXiqAEd8Mf8mM7QRsBfxkkuT8kJwb/C7WddsqZQ
         7Sx+mfyXr6JG3BxQ3EE0qKKmrH0Jc9aY8/ZRwLsANbS4SPS+PpveRJVI8lKf/PNWGSDn
         pN4caPQZjIqnsEFmi2sKqFqnEjwnWZeHUGWIiiqJ3Y4liUZvHwOCoBePZN60klyfjayq
         D7vg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599143; x=1688191143;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=AyD0cN67LNIzuGBK8Xxb/HVicw5rCOl2zKHQGUMlrFI=;
        b=XNccycI/KEvvZPlEazv3qeQY4RcS2iMfLz+YbXqeYAwxUnefUNkBRC8EquDKrip1bL
         hsmcdjnZBmCGgVMMWpsfmq4K/3RAD7G8z79kiIXM5sL/ubCaGTUZheBrYC3B/43+6ib7
         5Yfn7VdeennWWYkgnBj2LYM/RCecMV1FQXuvd3uoON6xVuKVS2xhVLkBP2yLdHciyUcD
         eERwvIRGf5fMC2D//nQk8EKcw0l2maKr8wYlI5+jxedsvuezCYKTmWQfq5tPgnOVEP1A
         QJihU62NyYRhK2/Zc7W/FV8w0Y31AAFkdfUn/kMKQCTchDj+7rjt3O89kcYVdfu2xClb
         YQtg==
X-Gm-Message-State: AC+VfDy5lVLou9ULuQ3OVFEDFjJvRPAyuGP9j6cOCe5vU+4RUiKaPUff
        HdV3fU0uz1zq2NuMleuoLPvcYeNEoj75wZQyIgDV+5nImlup52gAhHP83JkcRqnR48dbNkVmm2i
        ax+ce1hmNCoISS0CUNebqcNi1cQbQN9sQh+f3NZ9wugH7kKiqoCLis9HztBnA7mSiFwIMnAc=
X-Google-Smtp-Source: 
 ACHHUZ6u5HW7aY/ktCAO5XWaCNOrse7rwBxMhaWzsHEL7J50aagIzml2isGuN5GYY/rinPk1v3Yb+McNjlCu
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6902:4ea:b0:ba8:337a:d8a3 with SMTP id
 w10-20020a05690204ea00b00ba8337ad8a3mr4666669ybs.11.1685599142821; Wed, 31
 May 2023 22:59:02 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:07 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-5-jstultz@google.com>
Subject: [PATCH v4 04/13] locking/mutex: Rework task_struct::blocked_on
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

Track the blocked-on relation for mutexes, this allows following this
relation at schedule time.

   task
     | blocked-on
     v
   mutex
     | owner
     v
   task

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[minor changes while rebasing]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Fixed blocked_on tracking in error paths that was causing crashes
v4:
* Ensure we clear blocked_on when waking ww_mutexes to die or wound.
  This is critical so we don't get ciruclar blocked_on relationships
  that can't be resolved.
---
 include/linux/sched.h        |  5 +----
 kernel/fork.c                |  3 +--
 kernel/locking/mutex-debug.c |  9 +++++----
 kernel/locking/mutex.c       |  7 +++++++
 kernel/locking/ww_mutex.h    | 16 ++++++++++++++--
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 37dd571a1246..a312a2ff47bf 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1141,10 +1141,7 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
=20
-#ifdef CONFIG_DEBUG_MUTEXES
-	/* Mutex deadlock detection: */
-	struct mutex_waiter		*blocked_on;
-#endif
+	struct mutex			*blocked_on;	/* lock we're blocked on */
=20
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9244c540bb13 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2461,9 +2461,8 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
=20
-#ifdef CONFIG_DEBUG_MUTEXES
 	p->blocked_on =3D NULL; /* not blocked yet */
-#endif
+
 #ifdef CONFIG_BCACHE
 	p->sequential_io	=3D 0;
 	p->sequential_io_avg	=3D 0;
diff --git a/kernel/locking/mutex-debug.c b/kernel/locking/mutex-debug.c
index bc8abb8549d2..7228909c3e62 100644
--- a/kernel/locking/mutex-debug.c
+++ b/kernel/locking/mutex-debug.c
@@ -52,17 +52,18 @@ void debug_mutex_add_waiter(struct mutex *lock, struct =
mutex_waiter *waiter,
 {
 	lockdep_assert_held(&lock->wait_lock);
=20
-	/* Mark the current thread as blocked on the lock: */
-	task->blocked_on =3D waiter;
+	/* Current thread can't be already blocked (since it's executing!) */
+	DEBUG_LOCKS_WARN_ON(task->blocked_on);
 }
=20
 void debug_mutex_remove_waiter(struct mutex *lock, struct mutex_waiter *wa=
iter,
 			 struct task_struct *task)
 {
+	struct mutex *blocked_on =3D READ_ONCE(task->blocked_on);
+
 	DEBUG_LOCKS_WARN_ON(list_empty(&waiter->list));
 	DEBUG_LOCKS_WARN_ON(waiter->task !=3D task);
-	DEBUG_LOCKS_WARN_ON(task->blocked_on !=3D waiter);
-	task->blocked_on =3D NULL;
+	DEBUG_LOCKS_WARN_ON(blocked_on && blocked_on !=3D lock);
=20
 	INIT_LIST_HEAD(&waiter->list);
 	waiter->task =3D NULL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a528e7f42caa..d7a202c35ebe 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -646,6 +646,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 			goto err_early_kill;
 	}
=20
+	current->blocked_on =3D lock;
 	set_current_state(state);
 	trace_contention_begin(lock, LCB_F_MUTEX);
 	for (;;) {
@@ -683,6 +684,10 @@ __mutex_lock_common(struct mutex *lock, unsigned int s=
tate, unsigned int subclas
=20
 		first =3D __mutex_waiter_is_first(lock, &waiter);
=20
+		/*
+		 * Gets reset by ttwu_runnable().
+		 */
+		current->blocked_on =3D lock;
 		set_current_state(state);
 		/*
 		 * Here we order against unlock; we must either see it change
@@ -720,6 +725,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	debug_mutex_free_waiter(&waiter);
=20
 skip_wait:
+	current->blocked_on =3D NULL;
 	/* got the lock - cleanup and rejoice! */
 	lock_acquired(&lock->dep_map, ip);
 	trace_contention_end(lock, 0);
@@ -734,6 +740,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	return 0;
=20
 err:
+	current->blocked_on =3D NULL;
 	__set_current_state(TASK_RUNNING);
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 984a4e0bff36..7d623417b496 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -291,6 +291,12 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER=
 *waiter,
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
 		wake_q_add(&ww_ctx->wake_q, waiter->task);
+		/*
+		 * When waking up the task to die, be sure to clear the
+		 * blocked_on pointer. Otherwise we can see circular
+		 * blocked_on relationships that can't resolve.
+		 */
+		waiter->task->blocked_on =3D NULL;
 	}
=20
 	return true;
@@ -336,9 +342,15 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * it's wounded in __ww_mutex_check_kill() or has a
 		 * wakeup pending to re-read the wounded state.
 		 */
-		if (owner !=3D current)
+		if (owner !=3D current) {
 			wake_q_add(&ww_ctx->wake_q, owner);
-
+			/*
+			 * When waking up the task to wound, be sure to clear the
+			 * blocked_on pointer. Otherwise we can see circular
+			 * blocked_on relationships that can't resolve.
+			 */
+			owner->blocked_on =3D NULL;
+		}
 		return true;
 	}
=20
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 01D51C77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 05:59:35 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231758AbjFAF7d (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 01:59:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33418 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231454AbjFAF7X (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:23 -0400
Received: from mail-pg1-x54a.google.com (mail-pg1-x54a.google.com
 [IPv6:2607:f8b0:4864:20::54a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51EDC134
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:05 -0700 (PDT)
Received: by mail-pg1-x54a.google.com with SMTP id
 41be03b00d2f7-53f6f7d1881so501768a12.3
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:05 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599145; x=1688191145;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=tA6ilenfOVCfqFhRPW4Q4OY0y2aBdfOYBjx4iLdDwxg=;
        b=uu0J52T7qkJscVOK+7yZFuNHChkY0GtIR0E7rv/znylDY1txNu0fcXPhTO3RrOWXiH
         FKxkrh91Okg9+vhSl4s6hK0g7sxWjwUOuoSnDNp1ksWF3GZHI6YIe9I75iVdNFRkBMKV
         TjthkIwJBICMU6e+6kEjpZtvjrpby4RjpYFomTkPjtvEXSTcYCgbkssqnnc9K0WSdk25
         To3zQQA3qhG4TFEZ9mLaeIkm1c5yg2XoLH9wLDcT1afc/SPsa8xBnnH/cs6IRsspfWld
         d0p4zy53AgJD8flnmHvJeQTkCs1U2bUkPEKyAvupVLtFulrEnhz4aPytEsS+GgLkJNrM
         5c4g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599145; x=1688191145;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=tA6ilenfOVCfqFhRPW4Q4OY0y2aBdfOYBjx4iLdDwxg=;
        b=WnZK6N45JutQ4DCcaNtt4m3HZyzeILuPTNap3rYK39BAVFENl+2M+EWvex55/VUFVD
         3WVIfORQS5gUqimZxji+t8y9wqUmtRrg54Hj/6dt34VgLdLWUvCgq45eny9Zr0k1XFRr
         JUMwaArwmD1NXL8vAkuMSm1ca7842VlIWiwczBeJ3tNxVYyZuextf/mWiqozh8VBmxrN
         4pmjK+jcUltQ5oredebpFMjMCaeeEUKppzl/R12zcHaCqnXnpKYQ4PuXpmFPRdQ223NI
         XtNk6rj3uvPW3X48g1o/HS4zwnqmHnv1QqrAxren49o8obM59c2yVdn/cx/QY4uQYcpn
         6TJg==
X-Gm-Message-State: AC+VfDz0ZdQrC4cNYvdlvgOYnx4tjQ8FOmJzZAnCbndT+GeUdkv/RZHo
        cit17ox73ctB2Qz95j1ttu0imazbGqtKY1UtSru+qHu8Uo9+wBEM5GB1fAqyNdx37BGVF82Kxac
        qhuZ3jYS4e1RyuRTmwoot7qtKY5culOnyV3+nrGrIZgOXG8pCNMYSl+vl5bRNbeDTEcH3O4o=
X-Google-Smtp-Source: 
 ACHHUZ54FztApJjDueu0/LmkVbt/bJ5UP2ez5eKigt0JjhWe9SE6G5UjKN5mHPKh6fQgPYF9BRaL94HcaKhd
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a63:d044:0:b0:53f:32cf:bcd1 with SMTP id
 s4-20020a63d044000000b0053f32cfbcd1mr1588583pgi.5.1685599144613; Wed, 31 May
 2023 22:59:04 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:08 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-6-jstultz@google.com>
Subject: [PATCH v4 05/13] locking/mutex: Add task_struct::blocked_lock to
 serialize changes to the blocked_on state
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        Valentin Schneider <valentin.schneider@arm.com>,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

This patch was split out from the later "sched: Add proxy
execution" patch.

Adds blocked_lock to the task_struct so we can safely keep track
of which tasks are blocked on us.

This will be used for tracking blocked-task/mutex chains with
the prox-execution patch in a similar fashion to how priority
inheritence is done with rt_mutexes.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Split out from bigger patch]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Split out into its own patch
v4:
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
* Fixed nested block_on locking for ww_mutex access
---
 include/linux/sched.h     |  1 +
 init/init_task.c          |  1 +
 kernel/fork.c             |  1 +
 kernel/locking/mutex.c    | 22 ++++++++++++++++++----
 kernel/locking/ww_mutex.h |  6 ++++++
 5 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a312a2ff47bf..6b0d4b398b31 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1142,6 +1142,7 @@ struct task_struct {
 #endif
=20
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	raw_spinlock_t			blocked_lock;
=20
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
 	int				non_block_count;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe6b..189ce67e9704 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -130,6 +130,7 @@ struct task_struct init_task
 	.journal_info	=3D NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	=3D __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
+	.blocked_lock	=3D __RAW_SPIN_LOCK_UNLOCKED(init_task.blocked_lock),
 	.timer_slack_ns =3D 50000, /* 50 usec default slack */
 	.thread_pid	=3D &init_struct_pid,
 	.thread_group	=3D LIST_HEAD_INIT(init_task.thread_group),
diff --git a/kernel/fork.c b/kernel/fork.c
index 9244c540bb13..1ea1b2d527bb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2359,6 +2359,7 @@ __latent_entropy struct task_struct *copy_process(
 	ftrace_graph_init_task(p);
=20
 	rt_mutex_init_task(p);
+	raw_spin_lock_init(&p->blocked_lock);
=20
 	lockdep_assert_irqs_enabled();
 #ifdef CONFIG_PROVE_LOCKING
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index d7a202c35ebe..ac3d2e350fac 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -616,6 +616,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	}
=20
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
+	raw_spin_lock(&current->blocked_lock);
 	/*
 	 * After waiting to acquire the wait_lock, try again.
 	 */
@@ -677,6 +678,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 				goto err;
 		}
=20
+		raw_spin_unlock(&current->blocked_lock);
 		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 		if (ww_ctx)
 			ww_ctx_wake(ww_ctx);
@@ -684,6 +686,8 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
=20
 		first =3D __mutex_waiter_is_first(lock, &waiter);
=20
+		raw_spin_lock_irqsave(&lock->wait_lock, flags);
+		raw_spin_lock(&current->blocked_lock);
 		/*
 		 * Gets reset by ttwu_runnable().
 		 */
@@ -698,15 +702,23 @@ __mutex_lock_common(struct mutex *lock, unsigned int =
state, unsigned int subclas
 			break;
=20
 		if (first) {
+			bool acquired;
+
+			/*
+			 * mutex_optimistic_spin() can schedule, so  we need to
+			 * release these locks before calling it.
+			 */
+			raw_spin_unlock(&current->blocked_lock);
+			raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 			trace_contention_begin(lock, LCB_F_MUTEX | LCB_F_SPIN);
-			if (mutex_optimistic_spin(lock, ww_ctx, &waiter))
+			acquired =3D mutex_optimistic_spin(lock, ww_ctx, &waiter);
+			raw_spin_lock_irqsave(&lock->wait_lock, flags);
+			raw_spin_lock(&current->blocked_lock);
+			if (acquired)
 				break;
 			trace_contention_begin(lock, LCB_F_MUTEX);
 		}
-
-		raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	}
-	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 acquired:
 	__set_current_state(TASK_RUNNING);
=20
@@ -733,6 +745,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	if (ww_ctx)
 		ww_mutex_lock_acquired(ww, ww_ctx);
=20
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	if (ww_ctx)
 		ww_ctx_wake(ww_ctx);
@@ -745,6 +758,7 @@ __mutex_lock_common(struct mutex *lock, unsigned int st=
ate, unsigned int subclas
 	__mutex_remove_waiter(lock, &waiter);
 err_early_kill:
 	trace_contention_end(lock, ret);
+	raw_spin_unlock(&current->blocked_lock);
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
 	debug_mutex_free_waiter(&waiter);
 	mutex_release(&lock->dep_map, ip);
diff --git a/kernel/locking/ww_mutex.h b/kernel/locking/ww_mutex.h
index 7d623417b496..8378b533bb1e 100644
--- a/kernel/locking/ww_mutex.h
+++ b/kernel/locking/ww_mutex.h
@@ -287,6 +287,8 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER =
*waiter,
 		return false;
=20
 	if (waiter->ww_ctx->acquired > 0 && __ww_ctx_less(waiter->ww_ctx, ww_ctx)=
) {
+		/* nested as we should hold current->blocked_lock already */
+		raw_spin_lock_nested(&waiter->task->blocked_lock, SINGLE_DEPTH_NESTING);
 #ifndef WW_RT
 		debug_mutex_wake_waiter(lock, waiter);
 #endif
@@ -297,6 +299,7 @@ __ww_mutex_die(struct MUTEX *lock, struct MUTEX_WAITER =
*waiter,
 		 * blocked_on relationships that can't resolve.
 		 */
 		waiter->task->blocked_on =3D NULL;
+		raw_spin_unlock(&waiter->task->blocked_lock);
 	}
=20
 	return true;
@@ -343,6 +346,8 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 		 * wakeup pending to re-read the wounded state.
 		 */
 		if (owner !=3D current) {
+			/* nested as we should hold current->blocked_lock already */
+			raw_spin_lock_nested(&owner->blocked_lock, SINGLE_DEPTH_NESTING);
 			wake_q_add(&ww_ctx->wake_q, owner);
 			/*
 			 * When waking up the task to wound, be sure to clear the
@@ -350,6 +355,7 @@ static bool __ww_mutex_wound(struct MUTEX *lock,
 			 * blocked_on relationships that can't resolve.
 			 */
 			owner->blocked_on =3D NULL;
+			raw_spin_unlock(&owner->blocked_lock);
 		}
 		return true;
 	}
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1B934C7EE2C
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 05:59:39 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231768AbjFAF7g (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 01:59:36 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33416 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231667AbjFAF7Y (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:24 -0400
Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com
 [IPv6:2607:f8b0:4864:20::44a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C658C18D
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:07 -0700 (PDT)
Received: by mail-pf1-x44a.google.com with SMTP id
 d2e1a72fcca58-65026629c1eso57925b3a.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599147; x=1688191147;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=luB7F5gyV9IZ2t1qXPs5v7WDUVowNcPRs+IYymLmMxo=;
        b=QiR19i8UT0lRAkiXRHp/mtsGysuaM5y0+5Li2juqrCj7QwZFqTGE0tSLYCC/CH2ZTf
         0BQ13MyrxFE1ZjENkdwczkQgmxgU8paPsfNvIeT5Z4yLbMpF8m0tbk1zXXgK3nPt+6/M
         3HVpz3U8ONs4a9eA5wu+5ckQanQ/QsYAUX9RfQNZfZm/4qvXcBZMlohVif/DbeORT/1x
         4zSmw161pXVGJgmAMQsEbNrBBj+HtaAmoea6XGvI88x2VU7KO1SCvYEtF39qzq3BrafH
         QpEd/FONUUUp9pXDcE1d+odQZ4JTJtT9a5VTKOqCWluLZz40eX+YTV+Z3Ig6YOipIxm6
         eJzg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599147; x=1688191147;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=luB7F5gyV9IZ2t1qXPs5v7WDUVowNcPRs+IYymLmMxo=;
        b=eoYr6ZFiCFefppcFBFZlKPItM95QtmMIpeSRorCQhDjluiLMQxhn0shjIZgp4sMGcq
         ag0gHAxB5McX0o99ikoUxoCSq9cqdSjDQu89Y4FJNsdjr07PlY14RTuAyUQOMbTSgeS0
         S5IIMVBu2hVm62iu+W5Om286ERIxTA7+AocnnTqDZSyzXuUf5IGSyKn4Y7TUyPKQ/4d+
         Nl9U01beDI+eT02ETp+XqZLEKFnS3bghL1Y/5Blzo7i1zGmidYfx44osMlIJlvOQJ/f9
         EKNz6BbTLxN9IoRnFNldLQoDwhiS+pVWPSVUaiRMDhdzpKSOdTKxkdDHuxQdvBf9/khm
         56lw==
X-Gm-Message-State: AC+VfDxB2Djr2VnLHQ4MDzBi/ql7Qriyh+co1930kEJlf15lE6XM1DsV
        LQmk1QWdwSSeKUKvHzLqBfw/nHloPt8dxelRktL2Ik3+f93OHGvKrhbKSB53Ho5CdjvoC/hHb7A
        X3O8L3EgIRD+JlsQkweV6efVD+yVRUa3bKcJDPlNMoerOMYwg0wD5F/h9SkH+8TpaN/UbK0M=
X-Google-Smtp-Source: 
 ACHHUZ61YsxzFzSxcLzENguWBQp0dZzIVWTE0JbN+5zJJ9MrISL/JuQrZx+VuQGk4OEWb4hVbUjg+xFalI4D
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a63:914b:0:b0:51b:3c11:fb17 with SMTP id
 l72-20020a63914b000000b0051b3c11fb17mr1594380pge.12.1685599146627; Wed, 31
 May 2023 22:59:06 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:09 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-7-jstultz@google.com>
Subject: [PATCH v4 06/13] locking/mutex: Expose __mutex_owner()
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Juri Lelli <juri.lelli@redhat.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        Valentin Schneider <valentin.schneider@arm.com>,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Juri Lelli <juri.lelli@redhat.com>

Implementing proxy execution requires that scheduler code be able to
identify the current owner of a mutex. Expose __mutex_owner() for
this purpose (alone!).

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Removed the EXPORT_SYMBOL]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Reworked per Peter's suggestions]
Signed-off-by: John Stultz <jstultz@google.com>
---
v4:
* Move __mutex_owner() to kernel/locking/mutex.h instead of
  adding a new globally available accessor function to keep
  the exposure of this low, along with keeping it an inline
  function, as suggested by PeterZ
---
 kernel/locking/mutex.c | 25 -------------------------
 kernel/locking/mutex.h | 25 +++++++++++++++++++++++++
 2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index ac3d2e350fac..8c9f9dffe473 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -56,31 +56,6 @@ __mutex_init(struct mutex *lock, const char *name, struc=
t lock_class_key *key)
 }
 EXPORT_SYMBOL(__mutex_init);
=20
-/*
- * @owner: contains: 'struct task_struct *' to the current lock owner,
- * NULL means not owned. Since task_struct pointers are aligned at
- * at least L1_CACHE_BYTES, we have low bits to store extra state.
- *
- * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
- * Bit1 indicates unlock needs to hand the lock to the top-waiter
- * Bit2 indicates handoff has been done and we're waiting for pickup.
- */
-#define MUTEX_FLAG_WAITERS	0x01
-#define MUTEX_FLAG_HANDOFF	0x02
-#define MUTEX_FLAG_PICKUP	0x04
-
-#define MUTEX_FLAGS		0x07
-
-/*
- * Internal helper function; C doesn't allow us to hide it :/
- *
- * DO NOT USE (outside of mutex code).
- */
-static inline struct task_struct *__mutex_owner(struct mutex *lock)
-{
-	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLA=
GS);
-}
-
 static inline struct task_struct *__owner_task(unsigned long owner)
 {
 	return (struct task_struct *)(owner & ~MUTEX_FLAGS);
diff --git a/kernel/locking/mutex.h b/kernel/locking/mutex.h
index 0b2a79c4013b..1c7d3d32def8 100644
--- a/kernel/locking/mutex.h
+++ b/kernel/locking/mutex.h
@@ -20,6 +20,31 @@ struct mutex_waiter {
 #endif
 };
=20
+/*
+ * @owner: contains: 'struct task_struct *' to the current lock owner,
+ * NULL means not owned. Since task_struct pointers are aligned at
+ * at least L1_CACHE_BYTES, we have low bits to store extra state.
+ *
+ * Bit0 indicates a non-empty waiter list; unlock must issue a wakeup.
+ * Bit1 indicates unlock needs to hand the lock to the top-waiter
+ * Bit2 indicates handoff has been done and we're waiting for pickup.
+ */
+#define MUTEX_FLAG_WAITERS	0x01
+#define MUTEX_FLAG_HANDOFF	0x02
+#define MUTEX_FLAG_PICKUP	0x04
+
+#define MUTEX_FLAGS		0x07
+
+/*
+ * Internal helper function; C doesn't allow us to hide it :/
+ *
+ * DO NOT USE (outside of mutex & scheduler code).
+ */
+static inline struct task_struct *__mutex_owner(struct mutex *lock)
+{
+	return (struct task_struct *)(atomic_long_read(&lock->owner) & ~MUTEX_FLA=
GS);
+}
+
 #ifdef CONFIG_DEBUG_MUTEXES
 extern void debug_mutex_lock_common(struct mutex *lock,
 				    struct mutex_waiter *waiter);
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E6246C77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 05:59:56 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231713AbjFAF7z (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 01:59:55 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33428 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231695AbjFAF7Z (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:25 -0400
Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com
 [IPv6:2607:f8b0:4864:20::449])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4CE55198
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:09 -0700 (PDT)
Received: by mail-pf1-x449.google.com with SMTP id
 d2e1a72fcca58-64d4ea109faso571555b3a.1
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599148; x=1688191148;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=3+rZnIdWJABkpygy4zdB4e7dUMI9gFX506r6hRomUdM=;
        b=Z4EPGyR/VJ5SIF2UnRNBnN6R3o6nu7eIQjfrv8E2cr+qI39jqoDdo/3aXkuCHuT49e
         G3MC035LHiiPPsjHJJdicbRuPgsTsZtW5afwisdRBESR3vJYqdeHdl+8JHiDSNJLsyru
         AVL30nYa8xqbcGYFipgRFtpRbZXv81I2n2FK98H+xgke+66C38Uf7i6MNpAx0Mse5peB
         pAdTy7SL7KNpo8TNgbYWqur72Flzi85YuEWwZrmqe354HI0bIoljDjjTSboE+YMo2rTc
         fNi2wKpCV/xVJpNmyPueHTKJT4CwOOEOqLVE+eekti01wxpFEZ6Z2ksMhDmXtVNS9An7
         mnBw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599148; x=1688191148;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=3+rZnIdWJABkpygy4zdB4e7dUMI9gFX506r6hRomUdM=;
        b=Fld8Vix22hLlqh3+shdg6pcfuXu7VXe51KQ5QeW0atmM28di2M3BROJfztgsOKZKcF
         oJydL3Ztl3hs3PxuZYtua5lj8nny8x3QNfAXN0OYYMwZ8bY66fYqGYTAs9iBlD/m/d7m
         bgGneKp6/QRo6SFQNlZ2WBjZo5Mce9fixLAVi7O1Isvcylszg/G3h4/T2GkDr/HPtoPQ
         GWbKVU9ngPNcuBP+A6bHpxWQ9Gu/+YEmHxrCzmP7bpaRmp/RReaddAhalBq/1e0YR2Xy
         6iAPQjPYQs1dtAI6p8+fYE/3GsmsAW5OqcdZPmwZeFPZ44JobFvsG1Fft3Z73iIQhc9d
         HzLw==
X-Gm-Message-State: AC+VfDwPXnC14cL3n2t7/WGEsAbgusKyIDAfNSdmCLRtAyLkijz9Tvgg
        ndtSUJn+36TEn6A4TASQr7B+UxROgaAXM6ypoh8sxfIaMl7CicwBIrtg8dypdus259nLLd8mlzK
        iXVPSkhJta20m+w+EpSWGZWG1zKEFBC13haUD8Sun4qGMi4OCnOHyp9+LhsfrsoUFlX6O76I=
X-Google-Smtp-Source: 
 ACHHUZ6BiE2XZ2N3aJhK/90PFo6hBAC99gRUKKlKu2cIXrGJEfhGLE5hKF94NdVeXpbnfIzzumz2l11pJ1HJ
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6a00:24c9:b0:647:5400:a7cb with SMTP
 id d9-20020a056a0024c900b006475400a7cbmr3027830pfv.3.1685599148616; Wed, 31
 May 2023 22:59:08 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:10 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-8-jstultz@google.com>
Subject: [PATCH v4 07/13] sched: Split scheduler execution context
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

Lets define the scheduling context as all the scheduler state in
task_struct and the execution context as all state required to run
the task.

Currently both are intertwined in task_struct. We want to logically
split these such that we can run the execution context of one task
with the scheduling context of another.

To this purpose introduce rq_selected() macro to point to the
task_struct used for scheduler state and preserve rq->curr to
denote the execution context.

NOTE: Peter previously mentioned he didn't like the name
"rq_selected()", but I've not come up with a better alternative.
I'm very open to other name proposals.

Question for Peter: Dietmar suggested you'd prefer I drop the
conditionalization of the scheduler context pointer on the rq
(so rq_selected() would be open coded as rq->curr_sched or
whatever we agree on for a name), but I'd think in the
!CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
and its use (since it curr_sched would always be =3D=3D curr)?
If I'm wrong I'm fine switching this, but would appreciate
clarification.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
[add additional comments and update more sched_class code to use
 rq::proxy]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased and resolved minor collisions, reworked to use
 accessors, tweaked update_curr_common to use rq_proxy fixing rt
 scheduling issues]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
  other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
---
 kernel/sched/core.c     | 38 +++++++++++++++++++++++++-------------
 kernel/sched/deadline.c | 35 ++++++++++++++++++-----------------
 kernel/sched/fair.c     | 18 +++++++++---------
 kernel/sched/rt.c       | 41 ++++++++++++++++++++---------------------
 kernel/sched/sched.h    | 37 +++++++++++++++++++++++++++++++++++--
 5 files changed, 107 insertions(+), 62 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a68d1276bab0..ace75aadb90b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -793,7 +793,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *time=
r)
=20
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+	rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
 	rq_unlock(rq, &rf);
=20
 	return HRTIMER_NORESTART;
@@ -2200,16 +2200,18 @@ static inline void check_class_changed(struct rq *r=
q, struct task_struct *p,
=20
 void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->sched_class =3D=3D rq->curr->sched_class)
-		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
-	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+	struct task_struct *curr =3D rq_selected(rq);
+
+	if (p->sched_class =3D=3D curr->sched_class)
+		curr->sched_class->check_preempt_curr(rq, p, flags);
+	else if (sched_class_above(p->sched_class, curr->sched_class))
 		resched_curr(rq);
=20
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(curr) && test_tsk_need_resched(rq->curr))
 		rq_clock_skip_update(rq);
 }
=20
@@ -2599,7 +2601,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct a=
ffinity_context *ctx)
 		lockdep_assert_held(&p->pi_lock);
=20
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
=20
 	if (queued) {
 		/*
@@ -5535,7 +5537,7 @@ unsigned long long task_sched_runtime(struct task_str=
uct *p)
 	 * project cycles that may never be accounted to this
 	 * thread, breaking clock_gettime().
 	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
+	if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
 		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
@@ -5603,7 +5605,8 @@ void scheduler_tick(void)
 {
 	int cpu =3D smp_processor_id();
 	struct rq *rq =3D cpu_rq(cpu);
-	struct task_struct *curr =3D rq->curr;
+	/* accounting goes to the selected task */
+	struct task_struct *curr =3D rq_selected(rq);
 	struct rq_flags rf;
 	unsigned long thermal_pressure;
 	u64 resched_latency;
@@ -5701,6 +5704,13 @@ static void sched_tick_remote(struct work_struct *wo=
rk)
 	if (cpu_is_offline(cpu))
 		goto out_unlock;
=20
+	/*
+	 * Since this is a remote tick for full dynticks mode, we are
+	 * always sure that there is no proxy (only a single task is
+	 * running).
+	 */
+	SCHED_WARN_ON(rq->curr !=3D rq_selected(rq));
+
 	update_rq_clock(rq);
=20
 	if (!is_idle_task(curr)) {
@@ -6631,6 +6641,7 @@ static void __sched notrace __schedule(unsigned int s=
ched_mode)
 	}
=20
 	next =3D pick_next_task(rq, prev, &rf);
+	rq_set_selected(rq, next);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
@@ -7097,7 +7108,7 @@ void rt_mutex_setprio(struct task_struct *p, struct t=
ask_struct *pi_task)
=20
 	prev_class =3D p->sched_class;
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -7185,7 +7196,7 @@ void set_user_nice(struct task_struct *p, long nice)
 		goto out_unlock;
 	}
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -7749,7 +7760,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
=20
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -9249,6 +9260,7 @@ void __init init_idle(struct task_struct *idle, int c=
pu)
 	rcu_read_unlock();
=20
 	rq->idle =3D idle;
+	rq_set_selected(rq, idle);
 	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq =3D TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
@@ -9351,7 +9363,7 @@ void sched_setnuma(struct task_struct *p, int nid)
=20
 	rq =3D task_rq_lock(p, &rf);
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
=20
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10478,7 +10490,7 @@ void sched_move_task(struct task_struct *tsk)
=20
 	update_rq_clock(rq);
=20
-	running =3D task_current(rq, tsk);
+	running =3D task_current_selected(rq, tsk);
 	queued =3D task_on_rq_queued(tsk);
=20
 	if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f6f746d52410..d41d562df078 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1179,7 +1179,7 @@ static enum hrtimer_restart dl_task_timer(struct hrti=
mer *timer)
 #endif
=20
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-	if (dl_task(rq->curr))
+	if (dl_task(rq_selected(rq)))
 		check_preempt_curr_dl(rq, p, 0);
 	else
 		resched_curr(rq);
@@ -1306,7 +1306,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, str=
uct sched_dl_entity *dl_se)
  */
 static void update_curr_dl(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	struct sched_dl_entity *dl_se =3D &curr->dl;
 	s64 delta_exec, scaled_delta_exec;
 	int cpu =3D cpu_of(rq);
@@ -1819,7 +1819,7 @@ static int find_later_rq(struct task_struct *task);
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	bool select_rq;
 	struct rq *rq;
=20
@@ -1830,6 +1830,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int=
 flags)
=20
 	rcu_read_lock();
 	curr =3D READ_ONCE(rq->curr); /* unlocked access */
+	selected =3D READ_ONCE(rq_selected(rq));
=20
 	/*
 	 * If we are dealing with a -deadline task, we must
@@ -1840,9 +1841,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int=
 flags)
 	 * other hand, if it has a shorter deadline, we
 	 * try to make it stay here, it might be important.
 	 */
-	select_rq =3D unlikely(dl_task(curr)) &&
+	select_rq =3D unlikely(dl_task(selected)) &&
 		    (curr->nr_cpus_allowed < 2 ||
-		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+		     !dl_entity_preempt(&p->dl, &selected->dl)) &&
 		    p->nr_cpus_allowed > 1;
=20
 	/*
@@ -1905,7 +1906,7 @@ static void check_preempt_equal_dl(struct rq *rq, str=
uct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed =3D=3D 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
 		return;
=20
 	/*
@@ -1944,7 +1945,7 @@ static int balance_dl(struct rq *rq, struct task_stru=
ct *p, struct rq_flags *rf)
 static void check_preempt_curr_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+	if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
 		resched_curr(rq);
 		return;
 	}
@@ -1954,7 +1955,7 @@ static void check_preempt_curr_dl(struct rq *rq, stru=
ct task_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((p->dl.deadline =3D=3D rq->curr->dl.deadline) &&
+	if ((p->dl.deadline =3D=3D rq_selected(rq)->dl.deadline) &&
 	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
@@ -1989,7 +1990,7 @@ static void set_next_task_dl(struct rq *rq, struct ta=
sk_struct *p, bool first)
 	if (hrtick_enabled_dl(rq))
 		start_hrtick_dl(rq, p);
=20
-	if (rq->curr->sched_class !=3D &dl_sched_class)
+	if (rq_selected(rq)->sched_class !=3D &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
=20
 	deadline_queue_push_tasks(rq);
@@ -2306,8 +2307,8 @@ static int push_dl_task(struct rq *rq)
 	 * can move away, it makes sense to just reschedule
 	 * without going further in pushing next_task.
 	 */
-	if (dl_task(rq->curr) &&
-	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	if (dl_task(rq_selected(rq)) &&
+	    dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) =
&&
 	    rq->curr->nr_cpus_allowed > 1) {
 		resched_curr(rq);
 		return 0;
@@ -2432,7 +2433,7 @@ static void pull_dl_task(struct rq *this_rq)
 			 * deadline than the current task of its runqueue.
 			 */
 			if (dl_time_before(p->dl.deadline,
-					   src_rq->curr->dl.deadline))
+					   rq_selected(src_rq)->dl.deadline))
 				goto skip;
=20
 			if (is_migration_disabled(p)) {
@@ -2471,9 +2472,9 @@ static void task_woken_dl(struct rq *rq, struct task_=
struct *p)
 	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    p->nr_cpus_allowed > 1 &&
-	    dl_task(rq->curr) &&
+	    dl_task(rq_selected(rq)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+	     !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
 		push_dl_tasks(rq);
 	}
 }
@@ -2636,12 +2637,12 @@ static void switched_to_dl(struct rq *rq, struct ta=
sk_struct *p)
 		return;
 	}
=20
-	if (rq->curr !=3D p) {
+	if (rq_selected(rq) !=3D p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
 #endif
-		if (dl_task(rq->curr))
+		if (dl_task(rq_selected(rq)))
 			check_preempt_curr_dl(rq, p, 0);
 		else
 			resched_curr(rq);
@@ -2670,7 +2671,7 @@ static void prio_changed_dl(struct rq *rq, struct tas=
k_struct *p,
 	if (!rq->dl.overloaded)
 		deadline_queue_pull_task(rq);
=20
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		/*
 		 * If we now have a earlier deadline task than p,
 		 * then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bf9e8f29398e..62c3c1762004 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -919,7 +919,7 @@ static s64 update_curr_se(struct rq *rq, struct sched_e=
ntity *curr)
  */
 s64 update_curr_common(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	s64 delta_exec;
=20
 	delta_exec =3D update_curr_se(rq, &curr->se);
@@ -964,7 +964,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
=20
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->curr->se));
+	update_curr(cfs_rq_of(&rq_selected(rq)->se));
 }
=20
 static inline void
@@ -6230,7 +6230,7 @@ static void hrtick_start_fair(struct rq *rq, struct t=
ask_struct *p)
 		s64 delta =3D slice - ran;
=20
 		if (delta < 0) {
-			if (task_current(rq, p))
+			if (task_current_selected(rq, p))
 				resched_curr(rq);
 			return;
 		}
@@ -6245,7 +6245,7 @@ static void hrtick_start_fair(struct rq *rq, struct t=
ask_struct *p)
  */
 static void hrtick_update(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
=20
 	if (!hrtick_enabled_fair(rq) || curr->sched_class !=3D &fair_sched_class)
 		return;
@@ -7882,7 +7882,7 @@ static void set_skip_buddy(struct sched_entity *se)
  */
 static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int=
 wake_flags)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	struct sched_entity *se =3D &curr->se, *pse =3D &p->se;
 	struct cfs_rq *cfs_rq =3D task_cfs_rq(curr);
 	int scale =3D cfs_rq->nr_running >=3D sched_nr_latency;
@@ -7916,7 +7916,7 @@ static void check_preempt_wakeup(struct rq *rq, struc=
t task_struct *p, int wake_
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(rq->curr))
 		return;
=20
 	/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -8915,7 +8915,7 @@ static bool __update_blocked_others(struct rq *rq, bo=
ol *done)
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class =3D rq->curr->sched_class;
+	curr_class =3D rq_selected(rq)->sched_class;
=20
 	thermal_pressure =3D arch_scale_thermal_pressure(cpu_of(rq));
=20
@@ -12162,7 +12162,7 @@ prio_changed_fair(struct rq *rq, struct task_struct=
 *p, int oldprio)
 	 * our priority decreased, or if we are not currently running on
 	 * this runqueue and our priority is higher than the current's
 	 */
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		if (p->prio > oldprio)
 			resched_curr(rq);
 	} else
@@ -12307,7 +12307,7 @@ static void switched_to_fair(struct rq *rq, struct =
task_struct *p)
 		 * kick off the schedule if running, otherwise just see
 		 * if we can still preempt the current task.
 		 */
-		if (task_current(rq, p))
+		if (task_current_selected(rq, p))
 			resched_curr(rq);
 		else
 			check_preempt_curr(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0d0b276c447d..3ba24c3fce20 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -574,7 +574,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *r=
t_se, unsigned int flags)
=20
 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 {
-	struct task_struct *curr =3D rq_of_rt_rq(rt_rq)->curr;
+	struct task_struct *curr =3D rq_selected(rq_of_rt_rq(rt_rq));
 	struct rq *rq =3D rq_of_rt_rq(rt_rq);
 	struct sched_rt_entity *rt_se;
=20
@@ -1044,7 +1044,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt=
_rq)
  */
 static void update_curr_rt(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	struct sched_rt_entity *rt_se =3D &curr->rt;
 	s64 delta_exec;
=20
@@ -1591,7 +1591,7 @@ static int find_lowest_rq(struct task_struct *task);
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	struct rq *rq;
 	bool test;
=20
@@ -1603,6 +1603,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int=
 flags)
=20
 	rcu_read_lock();
 	curr =3D READ_ONCE(rq->curr); /* unlocked access */
+	selected =3D READ_ONCE(rq_selected(rq));
=20
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
@@ -1631,8 +1632,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int=
 flags)
 	 * systems like big.LITTLE.
 	 */
 	test =3D curr &&
-	       unlikely(rt_task(curr)) &&
-	       (curr->nr_cpus_allowed < 2 || curr->prio <=3D p->prio);
+	       unlikely(rt_task(selected)) &&
+	       (curr->nr_cpus_allowed < 2 || selected->prio <=3D p->prio);
=20
 	if (test || !rt_task_fits_capacity(p, cpu)) {
 		int target =3D find_lowest_rq(p);
@@ -1662,12 +1663,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, in=
t flags)
=20
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-	/*
-	 * Current can't be migrated, useless to reschedule,
-	 * let's hope p can move out.
-	 */
 	if (rq->curr->nr_cpus_allowed =3D=3D 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
 		return;
=20
 	/*
@@ -1710,7 +1707,9 @@ static int balance_rt(struct rq *rq, struct task_stru=
ct *p, struct rq_flags *rf)
  */
 static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, in=
t flags)
 {
-	if (p->prio < rq->curr->prio) {
+	struct task_struct *curr =3D rq_selected(rq);
+
+	if (p->prio < curr->prio) {
 		resched_curr(rq);
 		return;
 	}
@@ -1728,7 +1727,7 @@ static void check_preempt_curr_rt(struct rq *rq, stru=
ct task_struct *p, int flag
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio =3D=3D rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio =3D=3D curr->prio && !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -1753,7 +1752,7 @@ static inline void set_next_task_rt(struct rq *rq, st=
ruct task_struct *p, bool f
 	 * utilization. We only care of the case where we start to schedule a
 	 * rt task
 	 */
-	if (rq->curr->sched_class !=3D &rt_sched_class)
+	if (rq_selected(rq)->sched_class !=3D &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
=20
 	rt_queue_push_tasks(rq);
@@ -2033,7 +2032,7 @@ static struct task_struct *pick_next_pushable_task(st=
ruct rq *rq)
 			      struct task_struct, pushable_tasks);
=20
 	BUG_ON(rq->cpu !=3D task_cpu(p));
-	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current(rq, p) || task_current_selected(rq, p));
 	BUG_ON(p->nr_cpus_allowed <=3D 1);
=20
 	BUG_ON(!task_on_rq_queued(p));
@@ -2066,7 +2065,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2419,7 +2418,7 @@ static void pull_rt_task(struct rq *this_rq)
 			 * p if it is lower in priority than the
 			 * current task on the run queue
 			 */
-			if (p->prio < src_rq->curr->prio)
+			if (p->prio < rq_selected(src_rq)->prio)
 				goto skip;
=20
 			if (is_migration_disabled(p)) {
@@ -2461,9 +2460,9 @@ static void task_woken_rt(struct rq *rq, struct task_=
struct *p)
 	bool need_to_push =3D !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
-			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
+			    (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
 			    (rq->curr->nr_cpus_allowed < 2 ||
-			     rq->curr->prio <=3D p->prio);
+			     rq_selected(rq)->prio <=3D p->prio);
=20
 	if (need_to_push)
 		push_rt_tasks(rq);
@@ -2547,7 +2546,7 @@ static void switched_to_rt(struct rq *rq, struct task=
_struct *p)
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
-		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+		if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
 }
@@ -2562,7 +2561,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p,=
 int oldprio)
 	if (!task_on_rq_queued(p))
 		return;
=20
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 #ifdef CONFIG_SMP
 		/*
 		 * If our priority decreases while running, we
@@ -2588,7 +2587,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p,=
 int oldprio)
 		 * greater than the current running task
 		 * then reschedule.
 		 */
-		if (p->prio < rq->curr->prio)
+		if (p->prio < rq_selected(rq)->prio)
 			resched_curr(rq);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a1ef64449b2..29597a6fd65b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1008,7 +1008,10 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
=20
-	struct task_struct __rcu	*curr;
+	struct task_struct __rcu	*curr;       /* Execution context */
+#ifdef CONFIG_PROXY_EXEC
+	struct task_struct __rcu	*curr_sched; /* Scheduling context (policy) */
+#endif
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
@@ -1207,6 +1210,22 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
=20
+#ifdef CONFIG_PROXY_EXEC
+#define rq_selected(rq)		((rq)->curr_sched)
+#define cpu_curr_selected(cpu)	(cpu_rq(cpu)->curr_sched)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	rcu_assign_pointer(rq->curr_sched, t);
+}
+#else
+#define rq_selected(rq)		((rq)->curr)
+#define cpu_curr_selected(cpu)	(cpu_rq(cpu)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	/* Do nothing */
+}
+#endif
+
 struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2068,11 +2087,25 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
=20
+/*
+ * Is p the current execution context?
+ */
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
 	return rq->curr =3D=3D p;
 }
=20
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr =3D=3D rq_selected() =3D=3D p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct =
*p)
+{
+	return rq_selected(rq) =3D=3D p;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2234,7 +2267,7 @@ struct sched_class {
=20
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr !=3D prev);
+	WARN_ON_ONCE(rq_selected(rq) !=3D prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
=20
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D6CEDC77B7E
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 06:00:05 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231734AbjFAGAA (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 02:00:00 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33488 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231710AbjFAF70 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:26 -0400
Received: from mail-yw1-x1149.google.com (mail-yw1-x1149.google.com
 [IPv6:2607:f8b0:4864:20::1149])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE57F1AB
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:10 -0700 (PDT)
Received: by mail-yw1-x1149.google.com with SMTP id
 00721157ae682-5659c7dad06so7674347b3.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599150; x=1688191150;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=dZ6JJUxxg/evvLbMiOXiQLvIE2NeWkgAOuvpFCCCZeI=;
        b=nFzjfejdNuWfdbeW4COw8TITHM2+jzSeY30W6HIzeIEw0KdLMx6fFyQFq8IVguNPLE
         sPRZrxLlxrcuAkAn1vOhW4h19JugrwPcwiK09MKVkdGSDKX6+5riWbw+VMIk9JYqlJoh
         EWjQLwxPIYklx5Bd7nRr8pzjpd0lZVcph9z3qrvCzH3frC6vLVnBSXl2d4uMGnrnSEr0
         0lfXsKiXUSM0M4F3tqVNqhQIg23UqzLOGE4jfjQpY/22dJDgcwOs2QAdIu8UL/BsUHw7
         gLyl5SMnIg27SOZUO2XgQR1SvSq+WQ3LJUpZ/7jDB3n3UJC9wa0ijIm72rstIVv2ucAT
         1Cvg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599150; x=1688191150;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=dZ6JJUxxg/evvLbMiOXiQLvIE2NeWkgAOuvpFCCCZeI=;
        b=e5veZnNCV+71lZOR4ssDZf7wHjvQDVpR7B2gHS/Gu+e4CLIh+bqJqO4pIPXkqJ/LcA
         9GuKdvI4c8IxCKpJYjF7MiALAwNkdQc8HsqYbs1TskILoCy3WOvG748FDEJToJO82jGx
         1oAHqh7ftobGQSjLAx0021Ga9kLdpTPkUVOrExPTXH/UsBmppV0F214zwFqumTt9kvW/
         s78VWTKivY9+LhF/E+XDDdCDKdskZ7RmvLc7dWcTq/NPKbSS+V3BfHsyN9awoh8K/PvU
         yiHR3pBNH8xmki32DTAqqDsfYxjdyj/6bvhNfomjU6uLPKKQy/iw5lk37swWc83z2DiY
         NquA==
X-Gm-Message-State: AC+VfDwRlhOw0AHUQ09YeINYLBhtcaM/qDSgeegC888s93/ytydTgd1J
        6uCwW4tpTCH4OKXbiK+wPDyvaSwxw7ECp255ndYZ0F1GOdPrkwQWGQbnPDaYjKGZKrjGGAY7Mes
        4P/ZVX9a7KgPXPpSnrLktkzrw8UOi8ckud2jhWrRQFH1gzXrX+i0iljyvLa2FVYcoZ9QL73I=
X-Google-Smtp-Source: 
 ACHHUZ55SvZRPwcIQBjeAyn4cjvyPhWUBLUQTmafrND5OdDeR79g7o65gDgAOKUv95QJXsS3h6RWKeseG75S
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a81:ad5e:0:b0:565:b765:3fb with SMTP id
 l30-20020a81ad5e000000b00565b76503fbmr4718726ywk.9.1685599150121; Wed, 31 May
 2023 22:59:10 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:11 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-9-jstultz@google.com>
Subject: [PATCH v4 08/13] sched: Unnest ttwu_runnable in prep for
 proxy-execution
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

Slightly rework ttwu_runnable to minimize the nesting to help
make the proxy-execution changes easier to read.

Should be no logical change here.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ace75aadb90b..3dce69feb934 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3799,18 +3799,20 @@ static int ttwu_runnable(struct task_struct *p, int=
 wake_flags)
 	int ret =3D 0;
=20
 	rq =3D __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			update_rq_clock(rq);
-			check_preempt_curr(rq, p, wake_flags);
-		}
-		ttwu_do_wakeup(p);
-		ret =3D 1;
+	if (!task_on_rq_queued(p))
+		goto out_unlock;
+
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		update_rq_clock(rq);
+		check_preempt_curr(rq, p, wake_flags);
 	}
+	ttwu_do_wakeup(p);
+	ret =3D 1;
+out_unlock:
 	__task_rq_unlock(rq, &rf);
=20
 	return ret;
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 77AB7C77B7E
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 06:00:12 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231822AbjFAGAK (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 02:00:10 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33458 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231629AbjFAF71 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:27 -0400
Received: from mail-pj1-x1049.google.com (mail-pj1-x1049.google.com
 [IPv6:2607:f8b0:4864:20::1049])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C935D1B0
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:12 -0700 (PDT)
Received: by mail-pj1-x1049.google.com with SMTP id
 98e67ed59e1d1-256563a2097so66872a91.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599152; x=1688191152;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=xZN524CJUN5e0Pirf1W/KYPONg/g1WvIzNM3xlIxz/U=;
        b=bSYfIXe7m/IE0UNiNuUBMzCfB5MDvm8b1ebK+0UEs8877FvniZ2NqYD9JCqUQ36qET
         Vqh9PMxRsyrXIYDSuaeeTKtOcYcWlajLfXYBvQ6gnFjP4gUaJ4xpVbXvvscm+8GoUyCZ
         R8LW0VmJ2GRx7EOMj5QMvQn4c+aZ0+yxWN+mI8JOskmcij0uGPbQlYuzomSzcnoZrE4l
         iEdX82pARhBO5bvfF6vj7yHVRPl0nS29WayEAIBd9oj8rtaGENQUoPIbPuuRx0nOK/nq
         ityxi9AtWiM4guyXdgOJ1HYZY34Ld0i2yVBDIFBeF2/0CUC4osCrEXO37ytgi6iQ+FQL
         6IlQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599152; x=1688191152;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=xZN524CJUN5e0Pirf1W/KYPONg/g1WvIzNM3xlIxz/U=;
        b=WA+q9d/2Yb/UGlciuQJiSi2FpwIuLb/xvOk/HkJiGvj1elBEYGZ787jQBeV9iR0rqG
         ennJ7xVS8y1lYuSw7KXJ8B5a8GTgzf+LuCnlpQONvT82xZNk6TlZRy7Z3AIpPiBo2gg0
         G5Ov0vjoHwPL7VekG3ikoenm/ouGQ0XnCoKWU0UJIENI+QLQqPQfvX5CYRi6H3rgqVsW
         Bq/6lMpjlwtKWog1XO8MDf0pWuOshhZAHlwVm04yVKji0v5PV4DxQ+UqQPDeNroICGxt
         0Qchmq0m5Psg4/oPO3iF+TxMH8MHG7Y5UsgBmID+ekK0FBKhICpdoxtwQw9XoXrw3c/+
         /Egg==
X-Gm-Message-State: AC+VfDwyxiD8CgXWtZR4gL6/YWMgeAiBV/Q8zfSm6x/W64O5WDZ4c1zv
        MggtyBmminspEAqLiAexS+TzisbN6MNTjG12+kw8bYOPwFJmo2uLKP5s6xuw5P3EnMVZi+7Tkqm
        Mzw/OrgqfAxN85zDHY1x/WFZROq+0FtETtWIcJv8hka0N0DAhi65PvZrU/7qlIMR5RCC2gww=
X-Google-Smtp-Source: 
 ACHHUZ6d3Qe+CEnMzvuwJJSVIB+/3eCiXc9f+aRgBuS3eSm/+5LXmq1FxELSNekU3xdrf15r0yayt3Muwg8Q
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a17:903:449:b0:1b0:410e:906f with SMTP id
 iw9-20020a170903044900b001b0410e906fmr1811508plb.0.1685599152064; Wed, 31 May
 2023 22:59:12 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:12 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-10-jstultz@google.com>
Subject: [PATCH v4 09/13] sched: Add proxy execution
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        Valentin Schneider <valentin.schneider@arm.com>,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

The classic solution to priority-inversion is priority-
inheritance, which the kernel supports via the rt_mutex.

However, priority-inheritance only really works for interactions
between RT tasks and lower-priority tasks (RT or OTHER), as it
utilizes RT's strict priority ordering. With CFS and DEADLINE
classes, the next task chosen by the scheduler does not use a
linear priority ordering. So a more general solution is needed.

Proxy Execution provides just that: It allows mutex owner to be
run using the entire scheduling-context of tasks that are blocked
waiting on that mutex.

The basic mechanism is implemented by this patch, the core of
which resides in the proxy() function. Tasks blocked on mutexes
are not dequeued, so, if one of them is selected by schedule()
as the next task to be run on a CPU, proxy() is used to walk the
blocked_on relation and find a proxy task (a lock owner) to run
on the lock-waiters behalf (utilizing the lock-waiters
scheduling context).

This can be thought of as similar to rt_mutex priority-
inheritance, but utilizes the scheduler's pick_next_task()
function to determine the most important task to run next, (from
the set of runnable *and* mutex blocked tasks) rather then a
integer priority value. Then the proxy() function finds a
dependent lock owner to run, effecively boosting it by running
with the selected tasks scheduler context.

Here come the tricky bits. In fact, the owner task might be in
all sort of states when a proxy is found (blocked, executing on
a different CPU, etc.). Details on how to handle different
situations are to be found in proxy() code comments.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
[rebased, added comments and changelog]
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
[Fixed rebase conflicts]
[squashed sched: Ensure blocked_on is always guarded by blocked_lock]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
[fix rebase conflicts, various fixes & tweaks commented inline]
[squashed sched: Use rq->curr vs rq->proxy checks]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased, split up, and folded in changes from Juri
 Lelli and Connor O'Brian, added additional locking on
 get_task_blocked_on(next) logic, pretty major rework to better
 conditionalize logic on CONFIG_PROXY_EXEC and split up the very
 large proxy() function - hopefully without changes to logic /
 behavior]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Numerous changes folded in
* Split out some of the logic into separate patches
* Break up the proxy() function so its a bit easier to read
  and is better conditionalized on CONFIG_PROXY_EXEC
v3:
* Improve comments
* Added fix to call __balance_callbacks before we call
  pick_next_task() again, as a callback may have been set
  causing rq_pin_lock to generate warnings.
* Added fix to call __balance_callbacks before we drop
  the rq lock in proxy_migrate_task, to avoid rq_pin_lock
  from generating warnings if a callback was set
v4:
* Rename blocked_proxy -> blocked_donor to clarify relationship
* Fix null ptr deref at end of proxy()
* Fix null ptr deref in ttwu_proxy_skip_wakeup()  path
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
* Reword and expand commit message to provide more detailed
  context on how the idea works.
* Minor rebase for moving *_task_blocked_on() wrappers to be
  a later add on to the main patch series.

TODO: Finish conditionalization edge cases
---
 include/linux/sched.h   |   2 +
 init/Kconfig            |   7 +
 kernel/Kconfig.locks    |   2 +-
 kernel/fork.c           |   2 +
 kernel/locking/mutex.c  |  37 ++-
 kernel/sched/core.c     | 525 +++++++++++++++++++++++++++++++++++++++-
 kernel/sched/deadline.c |   2 +-
 kernel/sched/fair.c     |   9 +-
 kernel/sched/rt.c       |   3 +-
 kernel/sched/sched.h    |  20 +-
 10 files changed, 594 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6b0d4b398b31..8ac9db6ca747 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1141,7 +1141,9 @@ struct task_struct {
 	struct rt_mutex_waiter		*pi_blocked_on;
 #endif
=20
+	struct task_struct		*blocked_donor;	/* task that is boosting us */
 	struct mutex			*blocked_on;	/* lock we're blocked on */
+	struct list_head		blocked_entry;  /* tasks blocked on us */
 	raw_spinlock_t			blocked_lock;
=20
 #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
diff --git a/init/Kconfig b/init/Kconfig
index 32c24950c4ce..43abaffc7dfa 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -907,6 +907,13 @@ config NUMA_BALANCING_DEFAULT_ENABLED
 	  If set, automatic NUMA balancing will be enabled if running on a NUMA
 	  machine.
=20
+config PROXY_EXEC
+	bool "Proxy Execution"
+	default n
+	help
+	  This option enables proxy execution, a mechanism for mutex owning
+	  tasks to inherit the scheduling context of higher priority waiters.
+
 menuconfig CGROUPS
 	bool "Control Group support"
 	select KERNFS
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273ecd..791c98f1d329 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -226,7 +226,7 @@ config ARCH_SUPPORTS_ATOMIC_RMW
=20
 config MUTEX_SPIN_ON_OWNER
 	def_bool y
-	depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW
+	depends on SMP && ARCH_SUPPORTS_ATOMIC_RMW && !PROXY_EXEC
=20
 config RWSEM_SPIN_ON_OWNER
        def_bool y
diff --git a/kernel/fork.c b/kernel/fork.c
index 1ea1b2d527bb..2451eb8bcfe7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2462,7 +2462,9 @@ __latent_entropy struct task_struct *copy_process(
 	lockdep_init_task(p);
 #endif
=20
+	p->blocked_donor =3D NULL; /* nobody is boosting us yet */
 	p->blocked_on =3D NULL; /* not blocked yet */
+	INIT_LIST_HEAD(&p->blocked_entry);
=20
 #ifdef CONFIG_BCACHE
 	p->sequential_io	=3D 0;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index 8c9f9dffe473..eabfd66ce224 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -905,11 +905,13 @@ static noinline void __sched __mutex_unlock_slowpath(=
struct mutex *lock, unsigne
 {
 	struct task_struct *next =3D NULL;
 	DEFINE_WAKE_Q(wake_q);
-	unsigned long owner;
+	/* Always force HANDOFF for Proxy Exec for now. Revisit. */
+	unsigned long owner =3D MUTEX_FLAG_HANDOFF;
 	unsigned long flags;
=20
 	mutex_release(&lock->dep_map, ip);
=20
+#ifndef CONFIG_PROXY_EXEC
 	/*
 	 * Release the lock before (potentially) taking the spinlock such that
 	 * other contenders can get on with things ASAP.
@@ -932,10 +934,38 @@ static noinline void __sched __mutex_unlock_slowpath(=
struct mutex *lock, unsigne
 			return;
 		}
 	}
+#endif
=20
 	raw_spin_lock_irqsave(&lock->wait_lock, flags);
 	debug_mutex_unlock(lock);
-	if (!list_empty(&lock->wait_list)) {
+
+#ifdef CONFIG_PROXY_EXEC
+	raw_spin_lock(&current->blocked_lock);
+	/*
+	 * If we have a task boosting us, and that task was boosting us through
+	 * this lock, hand the lock to that task, as that is the highest
+	 * waiter, as selected by the scheduling function.
+	 */
+	next =3D current->blocked_donor;
+	if (next) {
+		struct mutex *next_lock;
+
+		raw_spin_lock_nested(&next->blocked_lock, SINGLE_DEPTH_NESTING);
+		next_lock =3D next->blocked_on;
+		raw_spin_unlock(&next->blocked_lock);
+		if (next_lock !=3D lock) {
+			next =3D NULL;
+		} else {
+			wake_q_add(&wake_q, next);
+			current->blocked_donor =3D NULL;
+		}
+	}
+#endif
+
+	/*
+	 * Failing that, pick any on the wait list.
+	 */
+	if (!next && !list_empty(&lock->wait_list)) {
 		/* get the first entry from the wait-list: */
 		struct mutex_waiter *waiter =3D
 			list_first_entry(&lock->wait_list,
@@ -951,6 +981,9 @@ static noinline void __sched __mutex_unlock_slowpath(st=
ruct mutex *lock, unsigne
 		__mutex_handoff(lock, next);
=20
 	preempt_disable();
+#ifdef CONFIG_PROXY_EXEC
+	raw_spin_unlock(&current->blocked_lock);
+#endif
 	raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
=20
 	wake_up_q(&wake_q);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3dce69feb934..328776421c7a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -95,6 +95,7 @@
 #include "../workqueue_internal.h"
 #include "../../io_uring/io-wq.h"
 #include "../smpboot.h"
+#include "../locking/mutex.h"
=20
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpu);
 EXPORT_TRACEPOINT_SYMBOL_GPL(ipi_send_cpumask);
@@ -2799,8 +2800,15 @@ static int affine_move_task(struct rq *rq, struct ta=
sk_struct *p, struct rq_flag
 	struct set_affinity_pending my_pending =3D { }, *pending =3D NULL;
 	bool stop_pending, complete =3D false;
=20
-	/* Can the task run on the task's current CPU? If so, we're done */
-	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask)) {
+	/*
+	 * Can the task run on the task's current CPU? If so, we're done
+	 *
+	 * We are also done if the task is selected, boosting a lock-
+	 * holding proxy, (and potentially has been migrated outside its
+	 * current or previous affinity mask)
+	 */
+	if (cpumask_test_cpu(task_cpu(p), &p->cpus_mask) ||
+	    (task_current_selected(rq, p) && !task_current(rq, p))) {
 		struct task_struct *push_task =3D NULL;
=20
 		if ((flags & SCA_MIGRATE_ENABLE) &&
@@ -3713,6 +3721,54 @@ static inline void ttwu_do_wakeup(struct task_struct=
 *p)
 	trace_sched_wakeup(p);
 }
=20
+#ifdef CONFIG_PROXY_EXEC
+static void activate_task_and_blocked_ent(struct rq *rq, struct task_struc=
t *p, int en_flags)
+{
+	/*
+	 * By calling activate_task with blocked_lock held, we order against
+	 * the proxy() blocked_task case such that no more blocked tasks will
+	 * be enqueued on p once we release p->blocked_lock.
+	 */
+	raw_spin_lock(&p->blocked_lock);
+	activate_task(rq, p, en_flags);
+	raw_spin_unlock(&p->blocked_lock);
+
+	/*
+	 * A whole bunch of 'proxy' tasks back this blocked task, wake
+	 * them all up to give this task its 'fair' share.
+	 */
+	while (!list_empty(&p->blocked_entry)) {
+		struct task_struct *pp =3D
+			list_first_entry(&p->blocked_entry,
+					 struct task_struct,
+					 blocked_entry);
+		raw_spin_lock(&pp->blocked_lock);
+		BUG_ON(pp->blocked_entry.prev !=3D &p->blocked_entry);
+
+		list_del_init(&pp->blocked_entry);
+		if (READ_ONCE(pp->on_rq)) {
+			/*
+			 * We raced with a non mutex handoff activation of pp.
+			 * That activation will also take care of activating
+			 * all of the tasks after pp in the blocked_entry list,
+			 * so we're done here.
+			 */
+			raw_spin_unlock(&pp->blocked_lock);
+			break;
+		}
+		__set_task_cpu(pp, cpu_of(rq));
+		activate_task(rq, pp, en_flags);
+		resched_curr(rq);
+		raw_spin_unlock(&pp->blocked_lock);
+	}
+}
+#else
+static inline void activate_task_and_blocked_ent(struct rq *rq, struct tas=
k_struct *p, int en_flags)
+{
+	activate_task(rq, p, en_flags);
+}
+#endif
+
 static void
 ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
 		 struct rq_flags *rf)
@@ -3734,7 +3790,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p=
, int wake_flags,
 		atomic_dec(&task_rq(p)->nr_iowait);
 	}
=20
-	activate_task(rq, p, en_flags);
+	activate_task_and_blocked_ent(rq, p, en_flags);
+
 	check_preempt_curr(rq, p, wake_flags);
=20
 	ttwu_do_wakeup(p);
@@ -3767,6 +3824,75 @@ ttwu_do_activate(struct rq *rq, struct task_struct *=
p, int wake_flags,
 #endif
 }
=20
+#ifdef CONFIG_PROXY_EXEC
+bool ttwu_proxy_skip_wakeup(struct rq *rq, struct task_struct *p)
+{
+	if (task_current(rq, p)) {
+		bool ret =3D true;
+
+		raw_spin_lock(&p->blocked_lock);
+		if (task_is_blocked(p) && __mutex_owner(p->blocked_on) =3D=3D p)
+			p->blocked_on =3D NULL;
+		if (!task_is_blocked(p))
+			ret =3D false;
+		raw_spin_unlock(&p->blocked_lock);
+		return ret;
+	}
+
+	/*
+	 * Since we don't dequeue for blocked-on relations, we'll always
+	 * trigger the on_rq_queued() clause for them.
+	 */
+	if (task_is_blocked(p)) {
+		raw_spin_lock(&p->blocked_lock);
+
+		if (!p->blocked_on || __mutex_owner(p->blocked_on) !=3D p) {
+			/*
+			 * p already woke, ran and blocked on another mutex.
+			 * Since a successful wakeup already happened, we're
+			 * done.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			return true;
+		}
+
+		p->blocked_on =3D NULL;
+		if (!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr)) {
+			/*
+			 * proxy stuff moved us outside of the affinity mask
+			 * 'sleep' now and fail the direct wakeup so that the
+			 * normal wakeup path will fix things.
+			 */
+			deactivate_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
+			if (task_current_selected(rq, p)) {
+				/*
+				 * If p is the proxy, then remove lingering
+				 * references to it from rq and sched_class structs after
+				 * dequeueing.
+				 */
+				put_prev_task(rq, p);
+				rq_set_selected(rq, rq->idle);
+			}
+			resched_curr(rq);
+			raw_spin_unlock(&p->blocked_lock);
+			return true;
+		}
+		/*
+		 * Must resched after killing a blocked_on relation. The currently
+		 * executing context might not be the most elegible anymore.
+		 */
+		resched_curr(rq);
+		raw_spin_unlock(&p->blocked_lock);
+	}
+	return false;
+}
+#else
+static inline bool ttwu_proxy_skip_wakeup(struct rq *rq, struct task_struc=
t *p)
+{
+	return false;
+}
+#endif
+
 /*
  * Consider @p being inside a wait loop:
  *
@@ -3799,9 +3925,15 @@ static int ttwu_runnable(struct task_struct *p, int =
wake_flags)
 	int ret =3D 0;
=20
 	rq =3D __task_rq_lock(p, &rf);
-	if (!task_on_rq_queued(p))
+	if (!task_on_rq_queued(p)) {
+		BUG_ON(task_is_running(p));
 		goto out_unlock;
+	}
=20
+	/*
+	 * ttwu_do_wakeup()->
+	 *   check_preempt_curr() may use rq clock
+	 */
 	if (!task_on_cpu(rq, p)) {
 		/*
 		 * When on_rq && !on_cpu the task is preempted, see if
@@ -3810,8 +3942,13 @@ static int ttwu_runnable(struct task_struct *p, int =
wake_flags)
 		update_rq_clock(rq);
 		check_preempt_curr(rq, p, wake_flags);
 	}
+
+	if (ttwu_proxy_skip_wakeup(rq, p))
+		goto out_unlock;
+
 	ttwu_do_wakeup(p);
 	ret =3D 1;
+
 out_unlock:
 	__task_rq_unlock(rq, &rf);
=20
@@ -4225,6 +4362,11 @@ try_to_wake_up(struct task_struct *p, unsigned int s=
tate, int wake_flags)
 	if (READ_ONCE(p->on_rq) && ttwu_runnable(p, wake_flags))
 		goto unlock;
=20
+	if (task_is_blocked(p)) {
+		success =3D 0;
+		goto unlock;
+	}
+
 #ifdef CONFIG_SMP
 	/*
 	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
@@ -5620,6 +5762,15 @@ void scheduler_tick(void)
=20
 	rq_lock(rq, &rf);
=20
+#ifdef CONFIG_PROXY_EXEC
+	if (task_cpu(curr) !=3D cpu) {
+		BUG_ON(!test_preempt_need_resched() &&
+		       !tif_need_resched());
+		rq_unlock(rq, &rf);
+		return;
+	}
+#endif
+
 	update_rq_clock(rq);
 	thermal_pressure =3D arch_scale_thermal_pressure(cpu_of(rq));
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
@@ -6520,6 +6671,332 @@ pick_next_task(struct rq *rq, struct task_struct *p=
rev, struct rq_flags *rf)
 # define SM_MASK_PREEMPT	SM_PREEMPT
 #endif
=20
+#ifdef CONFIG_PROXY_EXEC
+
+static struct task_struct *
+proxy_migrate_task(struct rq *rq, struct task_struct *next,
+		   struct rq_flags *rf, struct task_struct *p,
+		   int that_cpu, bool curr_in_chain)
+{
+	struct rq *that_rq;
+	LIST_HEAD(migrate_list);
+
+	/*
+	 * If the blocked-on relationship crosses CPUs, migrate @p to the
+	 * @owner's CPU.
+	 *
+	 * This is because we must respect the CPU affinity of execution
+	 * contexts (@owner) but we can ignore affinity for scheduling
+	 * contexts (@p). So we have to move scheduling contexts towards
+	 * potential execution contexts.
+	 */
+	that_rq =3D cpu_rq(that_cpu);
+
+	/*
+	 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+	 * to sort things out.
+	 */
+
+	/*
+	 * Since we're going to drop @rq, we have to put(@next) first,
+	 * otherwise we have a reference that no longer belongs to us.  Use
+	 * @fake_task to fill the void and make the next pick_next_task()
+	 * invocation happy.
+	 *
+	 * CPU0				CPU1
+	 *
+	 *				B mutex_lock(X)
+	 *
+	 * A mutex_lock(X) <- B
+	 * A __schedule()
+	 * A pick->A
+	 * A proxy->B
+	 * A migrate A to CPU1
+	 *				B mutex_unlock(X) -> A
+	 *				B __schedule()
+	 *				B pick->A
+	 *				B switch_to (A)
+	 *				A ... does stuff
+	 * A ... is still running here
+	 *
+	 *		* BOOM *
+	 */
+	put_prev_task(rq, next);
+	if (curr_in_chain) {
+		rq_set_selected(rq, rq->idle);
+		set_tsk_need_resched(rq->idle);
+		return rq->idle;
+	}
+	rq_set_selected(rq, rq->idle);
+
+	for (; p; p =3D p->blocked_donor) {
+		int wake_cpu =3D p->wake_cpu;
+
+		WARN_ON(p =3D=3D rq->curr);
+
+		deactivate_task(rq, p, 0);
+		set_task_cpu(p, that_cpu);
+		/*
+		 * We can abuse blocked_entry to migrate the thing,
+		 * because @p is still on the rq.
+		 */
+		list_add(&p->blocked_entry, &migrate_list);
+
+		/*
+		 * Preserve p->wake_cpu, such that we can tell where it
+		 * used to run later.
+		 */
+		p->wake_cpu =3D wake_cpu;
+	}
+
+	if (rq->balance_callback)
+		__balance_callbacks(rq);
+
+	rq_unpin_lock(rq, rf);
+	raw_spin_rq_unlock(rq);
+	raw_spin_rq_lock(that_rq);
+
+	while (!list_empty(&migrate_list)) {
+		p =3D list_first_entry(&migrate_list, struct task_struct, blocked_entry);
+		list_del_init(&p->blocked_entry);
+
+		enqueue_task(that_rq, p, 0);
+		check_preempt_curr(that_rq, p, 0);
+		p->on_rq =3D TASK_ON_RQ_QUEUED;
+	}
+
+	raw_spin_rq_unlock(that_rq);
+	raw_spin_rq_lock(rq);
+	rq_repin_lock(rq, rf);
+
+	return NULL; /* Retry task selection on _this_ CPU. */
+}
+
+static inline struct task_struct *
+proxy_resched_idle(struct rq *rq, struct task_struct *next)
+{
+	put_prev_task(rq, next);
+	rq_set_selected(rq, rq->idle);
+	set_tsk_need_resched(rq->idle);
+	return rq->idle;
+}
+
+static void proxy_enqueue_on_owner(struct rq *rq, struct task_struct *p,
+				   struct task_struct *owner,
+				   struct task_struct *next)
+{
+	/*
+	 * Walk back up the blocked_donor relation and enqueue them all on @owner
+	 *
+	 * ttwu_activate() will pick them up and place them on whatever rq
+	 * @owner will run next.
+	 */
+	if (!owner->on_rq) {
+		for (; p; p =3D p->blocked_donor) {
+			if (p =3D=3D owner)
+				continue;
+			BUG_ON(!p->on_rq);
+			deactivate_task(rq, p, DEQUEUE_SLEEP);
+			if (task_current_selected(rq, p)) {
+				put_prev_task(rq, next);
+				rq_set_selected(rq, rq->idle);
+			}
+			/*
+			 * ttwu_do_activate must not have a chance to activate p
+			 * elsewhere before it's fully extricated from its old rq.
+			 */
+			smp_mb();
+			list_add(&p->blocked_entry, &owner->blocked_entry);
+		}
+	}
+}
+
+/*
+ * Find who @next (currently blocked on a mutex) can proxy for.
+ *
+ * Follow the blocked-on relation:
+ *
+ *                ,-> task
+ *                |     | blocked-on
+ *                |     v
+ *  blocked_donor |   mutex
+ *                |     | owner
+ *                |     v
+ *                `-- task
+ *
+ * and set the blocked_donor relation, this latter is used by the mutex
+ * code to find which (blocked) task to hand-off to.
+ *
+ * Lock order:
+ *
+ *   p->pi_lock
+ *     rq->lock
+ *       mutex->wait_lock
+ *         p->blocked_lock
+ *
+ * Returns the task that is going to be used as execution context (the one
+ * that is actually going to be put to run on cpu_of(rq)).
+ */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+	struct task_struct *p =3D next;
+	struct task_struct *owner =3D NULL;
+	bool curr_in_chain =3D false;
+	int this_cpu, that_cpu;
+	struct mutex *mutex;
+
+	this_cpu =3D cpu_of(rq);
+
+	/*
+	 * Follow blocked_on chain.
+	 *
+	 * TODO: deadlock detection
+	 */
+	for (p =3D next; p->blocked_on; p =3D owner) {
+		mutex =3D p->blocked_on;
+		/* Something changed in the chain, pick_again */
+		if (!mutex)
+			return NULL;
+
+		/*
+		 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+		 * and ensure @owner sticks around.
+		 */
+		raw_spin_lock(&mutex->wait_lock);
+		raw_spin_lock(&p->blocked_lock);
+
+		/* Check again that p is blocked with blocked_lock held */
+		if (!task_is_blocked(p) || mutex !=3D p->blocked_on) {
+			/*
+			 * Something changed in the blocked_on chain and
+			 * we don't know if only at this level. So, let's
+			 * just bail out completely and let __schedule
+			 * figure things out (pick_again loop).
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return NULL;
+		}
+
+		if (task_current(rq, p))
+			curr_in_chain =3D true;
+
+		owner =3D __mutex_owner(mutex);
+		if (task_cpu(owner) !=3D this_cpu) {
+			that_cpu =3D task_cpu(owner);
+			/*
+			 * @owner can disappear, simply migrate to @that_cpu and leave that CPU
+			 * to sort things out.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+
+			return proxy_migrate_task(rq, next, rf, p, that_cpu, curr_in_chain);
+		}
+
+		if (task_on_rq_migrating(owner)) {
+			/*
+			 * One of the chain of mutex owners is currently migrating to this
+			 * CPU, but has not yet been enqueued because we are holding the
+			 * rq lock. As a simple solution, just schedule rq->idle to give
+			 * the migration a chance to complete. Much like the migrate_task
+			 * case we should end up back in proxy(), this time hopefully with
+			 * all relevant tasks already enqueued.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq, next);
+		}
+
+		if (!owner->on_rq) {
+			/*
+			 * rq->curr must not be added to the blocked_entry list or else
+			 * ttwu_do_activate could enqueue it elsewhere before it switches
+			 * out here. The approach to avoiding this is the same as in the
+			 * migrate_task case.
+			 */
+			if (curr_in_chain) {
+				raw_spin_unlock(&p->blocked_lock);
+				raw_spin_unlock(&mutex->wait_lock);
+				return proxy_resched_idle(rq, next);
+			}
+
+			/*
+			 * If !@owner->on_rq, holding @rq->lock will not pin the task,
+			 * so we cannot drop @mutex->wait_lock until we're sure its a blocked
+			 * task on this rq.
+			 *
+			 * We use @owner->blocked_lock to serialize against ttwu_activate().
+			 * Either we see its new owner->on_rq or it will see our list_add().
+			 */
+			if (owner !=3D p) {
+				raw_spin_unlock(&p->blocked_lock);
+				raw_spin_lock(&owner->blocked_lock);
+			}
+
+			proxy_enqueue_on_owner(rq, p, owner, next);
+
+			if (task_current_selected(rq, next)) {
+				put_prev_task(rq, next);
+				rq_set_selected(rq, rq->idle);
+			}
+			raw_spin_unlock(&owner->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+
+			return NULL; /* retry task selection */
+		}
+
+		if (owner =3D=3D p) {
+			/*
+			 * Its possible we interleave with mutex_unlock like:
+			 *
+			 *				lock(&rq->lock);
+			 *				  proxy()
+			 * mutex_unlock()
+			 *   lock(&wait_lock);
+			 *   next(owner) =3D current->blocked_donor;
+			 *   unlock(&wait_lock);
+			 *
+			 *   wake_up_q();
+			 *     ...
+			 *       ttwu_runnable()
+			 *         __task_rq_lock()
+			 *				  lock(&wait_lock);
+			 *				  owner =3D=3D p
+			 *
+			 * Which leaves us to finish the ttwu_runnable() and make it go.
+			 *
+			 * So schedule rq->idle so that ttwu_runnable can get the rq lock
+			 * and mark owner as running.
+			 */
+			raw_spin_unlock(&p->blocked_lock);
+			raw_spin_unlock(&mutex->wait_lock);
+			return proxy_resched_idle(rq, next);
+		}
+
+		/*
+		 * OK, now we're absolutely sure @owner is not blocked _and_
+		 * on this rq, therefore holding @rq->lock is sufficient to
+		 * guarantee its existence, as per ttwu_remote().
+		 */
+		raw_spin_unlock(&p->blocked_lock);
+		raw_spin_unlock(&mutex->wait_lock);
+
+		owner->blocked_donor =3D p;
+	}
+
+	WARN_ON_ONCE(owner && !owner->on_rq);
+	return owner;
+}
+#else /* PROXY_EXEC */
+static struct task_struct *
+proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
+{
+	return next;
+}
+#endif /* PROXY_EXEC */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6567,6 +7044,7 @@ static void __sched notrace __schedule(unsigned int s=
ched_mode)
 	struct rq_flags rf;
 	struct rq *rq;
 	int cpu;
+	bool preserve_need_resched =3D false;
=20
 	cpu =3D smp_processor_id();
 	rq =3D cpu_rq(cpu);
@@ -6612,7 +7090,7 @@ static void __sched notrace __schedule(unsigned int s=
ched_mode)
 	if (!(sched_mode & SM_MASK_PREEMPT) && prev_state) {
 		if (signal_pending_state(prev_state, prev)) {
 			WRITE_ONCE(prev->__state, TASK_RUNNING);
-		} else {
+		} else if (!task_is_blocked(prev)) {
 			prev->sched_contributes_to_load =3D
 				(prev_state & TASK_UNINTERRUPTIBLE) &&
 				!(prev_state & TASK_NOLOAD) &&
@@ -6638,13 +7116,43 @@ static void __sched notrace __schedule(unsigned int=
 sched_mode)
 				atomic_inc(&rq->nr_iowait);
 				delayacct_blkio_start();
 			}
+		} else {
+			/*
+			 * Let's make this task, which is blocked on
+			 * a mutex, (push/pull)able (RT/DL).
+			 * Unfortunately we can only deal with that by
+			 * means of a dequeue/enqueue cycle. :-/
+			 */
+			dequeue_task(rq, prev, 0);
+			enqueue_task(rq, prev, 0);
 		}
 		switch_count =3D &prev->nvcsw;
 	}
=20
-	next =3D pick_next_task(rq, prev, &rf);
+pick_again:
+	/*
+	 * If picked task is actually blocked it means that it can act as a
+	 * proxy for the task that is holding the mutex picked task is blocked
+	 * on. Get a reference to the blocked (going to be proxy) task here.
+	 * Note that if next isn't actually blocked we will have rq->proxy =3D=3D
+	 * rq->curr =3D=3D next in the end, which is intended and means that proxy
+	 * execution is currently "not in use".
+	 */
+	next =3D pick_next_task(rq, rq_selected(rq), &rf);
 	rq_set_selected(rq, next);
-	clear_tsk_need_resched(prev);
+	next->blocked_donor =3D NULL;
+	if (unlikely(task_is_blocked(next))) {
+		next =3D proxy(rq, next, &rf);
+		if (!next) {
+			__balance_callbacks(rq);
+			goto pick_again;
+		}
+		if (next =3D=3D rq->idle && prev =3D=3D rq->idle)
+			preserve_need_resched =3D true;
+	}
+
+	if (!preserve_need_resched)
+		clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
 	rq->last_seen_need_resched_ns =3D 0;
@@ -6731,6 +7239,9 @@ static inline void sched_submit_work(struct task_stru=
ct *tsk)
 	 */
 	SCHED_WARN_ON(current->__state & TASK_RTLOCK_WAIT);
=20
+	if (task_is_blocked(tsk))
+		return;
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d41d562df078..1d2711aee448 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1740,7 +1740,7 @@ static void enqueue_task_dl(struct rq *rq, struct tas=
k_struct *p, int flags)
=20
 	enqueue_dl_entity(&p->dl, flags);
=20
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 && !task_is_blocked(p))
 		enqueue_pushable_dl_task(rq, p);
 }
=20
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 62c3c1762004..43efc576d2c6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8023,7 +8023,9 @@ pick_next_task_fair(struct rq *rq, struct task_struct=
 *prev, struct rq_flags *rf
 		goto idle;
=20
 #ifdef CONFIG_FAIR_GROUP_SCHED
-	if (!prev || prev->sched_class !=3D &fair_sched_class)
+	if (!prev ||
+	    prev->sched_class !=3D &fair_sched_class ||
+	    rq->curr !=3D rq_selected(rq))
 		goto simple;
=20
 	/*
@@ -8541,6 +8543,9 @@ int can_migrate_task(struct task_struct *p, struct lb=
_env *env)
=20
 	lockdep_assert_rq_held(env->src_rq);
=20
+	if (task_is_blocked(p))
+		return 0;
+
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) throttled_lb_pair, or
@@ -8591,7 +8596,7 @@ int can_migrate_task(struct task_struct *p, struct lb=
_env *env)
 	/* Record that we found at least one task that could run on dst_cpu */
 	env->flags &=3D ~LBF_ALL_PINNED;
=20
-	if (task_on_cpu(env->src_rq, p)) {
+	if (task_on_cpu(env->src_rq, p) || task_current_selected(env->src_rq, p))=
 {
 		schedstat_inc(p->stats.nr_failed_migrations_running);
 		return 0;
 	}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 3ba24c3fce20..f5b1075e8170 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1537,7 +1537,8 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p,=
 int flags)
=20
 	enqueue_rt_entity(rt_se, flags);
=20
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
+	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 &&
+	    !task_is_blocked(p))
 		enqueue_pushable_task(rq, p);
 }
=20
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 29597a6fd65b..1c832516b7e8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2106,6 +2106,19 @@ static inline int task_current_selected(struct rq *r=
q, struct task_struct *p)
 	return rq_selected(rq) =3D=3D p;
 }
=20
+#ifdef CONFIG_PROXY_EXEC
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	return !!p->blocked_on;
+}
+#else /* !PROXY_EXEC */
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	return false;
+}
+
+#endif /* PROXY_EXEC */
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2267,12 +2280,17 @@ struct sched_class {
=20
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq_selected(rq) !=3D prev);
+	WARN_ON_ONCE(rq->curr !=3D prev && prev !=3D rq_selected(rq));
+
+	if (prev =3D=3D rq_selected(rq) && task_cpu(prev) !=3D cpu_of(rq))
+		return;
+
 	prev->sched_class->put_prev_task(rq, prev);
 }
=20
 static inline void set_next_task(struct rq *rq, struct task_struct *next)
 {
+	WARN_ON_ONCE(!task_current_selected(rq, next));
 	next->sched_class->set_next_task(rq, next, false);
 }
=20
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 58F72C77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 06:00:28 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231795AbjFAGA0 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 02:00:26 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33420 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231384AbjFAF7j (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:39 -0400
Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com
 [IPv6:2607:f8b0:4864:20::114a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 30FD7E44
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:15 -0700 (PDT)
Received: by mail-yw1-x114a.google.com with SMTP id
 00721157ae682-5659c7dad06so7675827b3.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599154; x=1688191154;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=akU7syuB1Kx4qpn7Uy+Yr+VxoiONi5TpTWs8ayG4zr0=;
        b=m4fwMK5bDUCAE+/OfwoTw/BIlN4EhbQ18rCUTHajji86kngwCfEn7ejf/syIqockRx
         t+PB7dPxWdmH4htk6K3SasBf2sOtJaTxpWAA/kRqvvC3qRzTtKtLvoVTp8SlJ7ZQuLVd
         3DwIzipvceN/tkYnmm6J1iPKQTJ2tlda1cqj1t3WqyMpBIB+aqrqVgn5h+OzvMkTIYVQ
         ONjFWYQN/kkcTzKv3tcTvBrlaD4xyi7iB7omNfI8hhfKs+BYc8nt/tjiBVyrEg59XETV
         7RrWno+kd7klPCWmedw1TQ73FEEgy5xycJ3npqOqqQe643+DiyFB9YDFX9Sr0WibCb0q
         TAXA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599154; x=1688191154;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=akU7syuB1Kx4qpn7Uy+Yr+VxoiONi5TpTWs8ayG4zr0=;
        b=HlXc3L2novZnX2aLzBro79PZgnw4Oin2f2nvuLd/AkMHPj/mWVnJjiq9MhvZbCs+dB
         BCjBkl5bB9dYM3wgjGAnJ4M8tUtw3r4nmtnmkqxjXdJwZwUil4EeL+qfQejx6ianszeb
         K1b6/UcbOzdGsWgv7syrC09PgVlMYNhzosw06BPVnEC8U8qrwh4jQLO1Zd1ktAnx/6hX
         SeePt5SaQtYBoOvzMtuCT38Fk+4fvvwVskIT/mX/z1RbD4aZ4Y/A3k+Fm1qb59OzjXTb
         +x3arKMKcD+DyZ/ahSbXscrOmsDYev+XZqNltTEkGBgGZo3kC1x8VAwlryivz2em6sse
         WxxA==
X-Gm-Message-State: AC+VfDwRmOW6i54WXgAeIyUS1RO836S9Rzws0rYWR0PLeJ3i1xpK22Io
        271QDRZ2NTrUsV6gk+SILBLfVX0VfhDLTG81Grt/o/rKK6YO7uqjDF64lpfrC1js15PtfUCDBMc
        ycbDLYUimstIRAk8mVU851hG+KOXH/ZHw5v5WAdT2Vt5VGcbXv2rtelv34b9BAMVb1bJk7z8=
X-Google-Smtp-Source: 
 ACHHUZ7Yx13lcijwiBfNbyigouS/PkQ3hC4O1oFFkjiKEUdNj48UqOUa0y+3XHux3gDMgHXHasPHaRPgn2pW
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a81:e244:0:b0:565:ba68:3c2 with SMTP id
 z4-20020a81e244000000b00565ba6803c2mr4728822ywl.8.1685599153883; Wed, 31 May
 2023 22:59:13 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:13 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-11-jstultz@google.com>
Subject: [PATCH v4 10/13] sched/rt: Fix proxy/current (push,pull)ability
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        "Connor O'Brien" <connoro@google.com>,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Valentin Schneider <valentin.schneider@arm.com>

Proxy execution forms atomic pairs of tasks: a proxy (scheduling context)
and an owner (execution context). The proxy, along with the rest of the
blocked chain, follows the owner wrt CPU placement.

They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:

	      ,->  RT42
	      |     | blocked-on
	      |     v
blocked_donor |   mutex
	      |     | owner
	      |     v
	      `--  RT1

   RT1
   RT42

  CPU0            CPU1
   ^                ^
   |                |
  overloaded    !overloaded
  rq prio =3D 42  rq prio =3D 0

RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.

Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().

XXX: pinned tasks becoming unblocked should be removed from the push/pull
lists, but those don't get to see __schedule() straight away.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Connor O'Brien <connoro@google.com>
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Tweaked comments & commit message

TODO: Rework the wording of the commit message to match the rq_selected
renaming. (XXX Maybe "Delegator" for the task being proxied for?)
---
 kernel/sched/core.c | 36 +++++++++++++++++++++++++++---------
 kernel/sched/rt.c   | 22 +++++++++++++++++-----
 2 files changed, 44 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 328776421c7a..c56921dc427e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6989,12 +6989,29 @@ proxy(struct rq *rq, struct task_struct *next, stru=
ct rq_flags *rf)
 	WARN_ON_ONCE(owner && !owner->on_rq);
 	return owner;
 }
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)
+{
+	/*
+	 * pick_next_task() calls set_next_task() on the selected task
+	 * at some point, which ensures it is not push/pullable.
+	 * However, the selected task *and* the ,mutex owner form an
+	 * atomic pair wrt push/pull.
+	 *
+	 * Make sure owner is not pushable. Unfortunately we can only
+	 * deal with that by means of a dequeue/enqueue cycle. :-/
+	 */
+	dequeue_task(rq, next, DEQUEUE_NOCLOCK | DEQUEUE_SAVE);
+	enqueue_task(rq, next, ENQUEUE_NOCLOCK | ENQUEUE_RESTORE);
+}
 #else /* PROXY_EXEC */
 static struct task_struct *
 proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
 {
 	return next;
 }
+
+static inline void proxy_tag_curr(struct rq *rq, struct task_struct *next)=
 { }
 #endif /* PROXY_EXEC */
=20
 /*
@@ -7043,6 +7060,7 @@ static void __sched notrace __schedule(unsigned int s=
ched_mode)
 	unsigned long prev_state;
 	struct rq_flags rf;
 	struct rq *rq;
+	bool proxied;
 	int cpu;
 	bool preserve_need_resched =3D false;
=20
@@ -7116,19 +7134,11 @@ static void __sched notrace __schedule(unsigned int=
 sched_mode)
 				atomic_inc(&rq->nr_iowait);
 				delayacct_blkio_start();
 			}
-		} else {
-			/*
-			 * Let's make this task, which is blocked on
-			 * a mutex, (push/pull)able (RT/DL).
-			 * Unfortunately we can only deal with that by
-			 * means of a dequeue/enqueue cycle. :-/
-			 */
-			dequeue_task(rq, prev, 0);
-			enqueue_task(rq, prev, 0);
 		}
 		switch_count =3D &prev->nvcsw;
 	}
=20
+	proxied =3D !!prev->blocked_donor;
 pick_again:
 	/*
 	 * If picked task is actually blocked it means that it can act as a
@@ -7165,6 +7175,10 @@ static void __sched notrace __schedule(unsigned int =
sched_mode)
 		 * changes to task_struct made by pick_next_task().
 		 */
 		RCU_INIT_POINTER(rq->curr, next);
+
+		if (unlikely(!task_current_selected(rq, next)))
+			proxy_tag_curr(rq, next);
+
 		/*
 		 * The membarrier system call requires each architecture
 		 * to have a full memory barrier after updating
@@ -7189,6 +7203,10 @@ static void __sched notrace __schedule(unsigned int =
sched_mode)
 		/* Also unlocks the rq: */
 		rq =3D context_switch(rq, prev, next, &rf);
 	} else {
+		/* In case next was already curr but just got blocked_donor*/
+		if (unlikely(!proxied && next->blocked_donor))
+			proxy_tag_curr(rq, next);
+
 		rq->clock_update_flags &=3D ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
=20
 		rq_unpin_lock(rq, &rf);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f5b1075e8170..d6bffcf31de0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1537,9 +1537,21 @@ enqueue_task_rt(struct rq *rq, struct task_struct *p=
, int flags)
=20
 	enqueue_rt_entity(rt_se, flags);
=20
-	if (!task_current(rq, p) && p->nr_cpus_allowed > 1 &&
-	    !task_is_blocked(p))
-		enqueue_pushable_task(rq, p);
+	/*
+	 * Current can't be pushed away. Proxy is tied to current, so don't
+	 * push it either.
+	 */
+	if (task_current(rq, p) || task_current_selected(rq, p))
+		return;
+
+	/*
+	 * Pinned tasks can't be pushed.
+	 * Affinity of blocked tasks doesn't matter.
+	 */
+	if (!task_is_blocked(p) && p->nr_cpus_allowed =3D=3D 1)
+		return;
+
+	enqueue_pushable_task(rq, p);
 }
=20
 static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int flag=
s)
@@ -1828,9 +1840,9 @@ static void put_prev_task_rt(struct rq *rq, struct ta=
sk_struct *p)
=20
 	/*
 	 * The previous task needs to be made eligible for pushing
-	 * if it is still active
+	 * if it is still active. Affinity of blocked task doesn't matter.
 	 */
-	if (on_rt_rq(&p->rt) && p->nr_cpus_allowed > 1)
+	if (on_rt_rq(&p->rt) && (p->nr_cpus_allowed > 1 || task_is_blocked(p)))
 		enqueue_pushable_task(rq, p);
 }
=20
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CFD6AC77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 06:00:36 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231362AbjFAGAd (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 02:00:33 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33794 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231791AbjFAF7q (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:46 -0400
Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com
 [IPv6:2607:f8b0:4864:20::114a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F036E4E
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:16 -0700 (PDT)
Received: by mail-yw1-x114a.google.com with SMTP id
 00721157ae682-568a8704f6dso6802827b3.1
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599155; x=1688191155;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=sjSiEVFvwgKwRefW1DaGSRLOPMcqZfNZ3si0heaKz+s=;
        b=b1+z3n6FNugQyOLOwnC3N2SWHkOrz7xV94xG8m9n8T+0UtNRNf7tXQzr1gSb+gwSsE
         oKuKYcSx1oWajoDSZJZYCZ1EYISY9gpjvfSeIsDpBqO73H+ysVsENVAN+wDDw75sgj11
         zdnsq3MoAdutrtns1l7bJzmL6oFwp+FLqJ1DYtK5cBt7iQEoYzg8oBHu0fr99Vc5tMgh
         amfxXyWpjoC3CX5ZtaUQ3ScErPeOQDErr9+Funxhn69wsf6+/LOeU8oCRkIdqYo39cJT
         SXqZ6Db9Ls1W59tP78/BE4pOtP39ZF+OyFG1LJPMX/k7FbxbN52LaJ14p0CppD1kyKD6
         xdyA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599155; x=1688191155;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=sjSiEVFvwgKwRefW1DaGSRLOPMcqZfNZ3si0heaKz+s=;
        b=VWCTB998tITwkqMLxFIz7rdhiaohAYGGMPiCPv2VR+MS4fx1bycD2nOCuME7EujqcY
         ZDgtRN1ltD5Jtu3oUZnvaawTp27f/LQ4TU54S5Wf5ktr1ZODWXCFEY2BgiY/toNwXjfm
         jEow4dLhPe5Uhkl9vXXPLvGwwAtOcMuFLltKjKOXA5Wct0FmerZn7xqXGXSHMOy3v2FU
         9RgoYYbh0ezfon7YVZYI+dvPTk3cei3a9bWasqtE4jmCeghK+XcN8XpSALmO22b7sbqu
         /lnmebJ4tza2nDlV8F26bocTSlVh3nSZ8DzXQeskbUEr9ebsqdfqb3OSj3wrv5600QEG
         EA6w==
X-Gm-Message-State: AC+VfDxN+TYbHMOGmWFS4N6xhhkJJ8p77wP7a+PecB1KWgqyxhZd9gRC
        yyBsxycZW1IMdF+eyuW2TaA0DTN/ox7m5uOmuHLWnYcuxHk2Sc74zhI6YS66kcTgc9B2qRyOQBn
        ASjQMRpC+GFjBHcg25r0tN6J0APNG00x1M6AlDRkTPdi6eTHT6CtkfX8j5khq2N+y6fvX7q8=
X-Google-Smtp-Source: 
 ACHHUZ4ae+ByRjg7oFCP5l58qYx1ruaRX97pztd56Cuj+KYgcUUo5vzWbgY5iH3Ls6sfPyyTTIb9IkgcGi5R
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a81:af4d:0:b0:565:9e73:f92f with SMTP id
 x13-20020a81af4d000000b005659e73f92fmr4558928ywj.10.1685599155660; Wed, 31
 May 2023 22:59:15 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:14 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-12-jstultz@google.com>
Subject: [PATCH v4 11/13] sched: Fix runtime accounting w/ proxy-execution
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

The idea here is we want to charge the selected task's vruntime
but charge the executed task's sum_exec_runtime.

This way cputime accounting goes against the task actually running
but vruntime accounting goes against the selected task so we get
proper fairness.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/fair.c | 24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43efc576d2c6..c2e17bfa6b31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -891,22 +891,36 @@ static void update_tg_load_avg(struct cfs_rq *cfs_rq)
 }
 #endif /* CONFIG_SMP */
=20
-static s64 update_curr_se(struct rq *rq, struct sched_entity *curr)
+static s64 update_curr_se(struct rq *rq, struct sched_entity *se)
 {
 	u64 now =3D rq_clock_task(rq);
 	s64 delta_exec;
=20
-	delta_exec =3D now - curr->exec_start;
+	/* Calculate the delta from selected se */
+	delta_exec =3D now - se->exec_start;
 	if (unlikely(delta_exec <=3D 0))
 		return delta_exec;
=20
-	curr->exec_start =3D now;
-	curr->sum_exec_runtime +=3D delta_exec;
+	/* Update selected se's exec_start */
+	se->exec_start =3D now;
+	if (entity_is_task(se)) {
+		struct task_struct *running =3D rq->curr;
+		/*
+		 * If se is a task, we account the time
+		 * against the running task, as w/ proxy-exec
+		 * they may not be the same.
+		 */
+		running->se.exec_start =3D now;
+		running->se.sum_exec_runtime +=3D delta_exec;
+	} else {
+		/* If not task, account the time against se */
+		se->sum_exec_runtime +=3D delta_exec;
+	}
=20
 	if (schedstat_enabled()) {
 		struct sched_statistics *stats;
=20
-		stats =3D __schedstats_from_se(curr);
+		stats =3D __schedstats_from_se(se);
 		__schedstat_set(stats->exec_max,
 				max(delta_exec, stats->exec_max));
 	}
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 615E6C77B7E
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 06:00:54 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S229745AbjFAGAx (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 02:00:53 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33476 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231823AbjFAF7w (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:52 -0400
Received: from mail-yb1-xb4a.google.com (mail-yb1-xb4a.google.com
 [IPv6:2607:f8b0:4864:20::b4a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97BE7E57
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:18 -0700 (PDT)
Received: by mail-yb1-xb4a.google.com with SMTP id
 3f1490d57ef6-ba8cf175f5bso4086699276.0
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599158; x=1688191158;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=VZEDw7T8jyn9hTL84YkZjKnXpFBEhVZAMOLLsrmZEN4=;
        b=WYKSxpMrmDbSYLsQeaM+5wQCPuBp98WSqdLa1Dh2EGoC/RXoW6lmIklVjERDYqof1O
         Ow0PQxa/6AbwxpfchMe6slQ56dvb2oYXl2idDloav8DhfMgtbcHafd6/R+ushkzrbHyL
         OT8RNsoPRFOmOfkjiTO2nHEO0npB98eQYFlqYzoM1MkOexprFaObfRmh37ABiM5n3gOW
         Ak7NWGYzNZlUNXK8+vaThbU+fDYbwSbxEmyvRgza4SAo/3A/vURhCW2VYDWe2kL6t1ay
         YCIdnyC7lXx1Ykklkzs64nQ76IqN2RftESGIJtYc6pyQy5z6GB7ueE5wJuXvTlTOjvjd
         kDhQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599158; x=1688191158;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=VZEDw7T8jyn9hTL84YkZjKnXpFBEhVZAMOLLsrmZEN4=;
        b=Wm5UjV2JuiboGMIEai/J8fb7rBdAfyGYYqze5/Xm2/QCpPVhFGG9NH0sw2/WTR3K+W
         Fx4GfQo6q7sqkqNGFfBPklsmmuPLiNNd5tOT724KAf6buFvHjaxxHG1jM3l0XoYsUmnH
         mLtkF5bD5EcW6YNA6jsU/QhzmYW4onPbgLOP1LYCfu14U0Ulqxz4Z4G0wYDpHWfYGKJR
         qPYRFEerxQOYCKZnz0JMJMxG3+GBETXjfog3480R4ZAWxF2B56bLYYfULN/Bq84iG5xY
         6GJX/4O234xagJOv4ZYg1yVUyadCSOqvyrUNj7xn6lWggO3Vy4FbnQtheDb1JkalQqmt
         myhQ==
X-Gm-Message-State: AC+VfDyAW1BVy8/EYB1x5WYqbX7y6akb7qYkQSqrv0kaepVh23/IP96D
        Us/N137bkK7gEWCNUkEpuonYKBn84ACGFqeuC+38W8PzHVvsu/S1jt7Wuq8dJqZbWXPmENpcLX4
        nURSgG1jVuK8W3NF1oCxBSBbafsobJC0iMghuLkKufWq/4zIZc/qSqtCUMEVb9yghvum0A7Y=
X-Google-Smtp-Source: 
 ACHHUZ5iBSlUsGR/ZNxypESof6ALaoE7ox3NUjuYg56Dhcsptcj8Zgim/v0x7Y360DsQQC34/l51KXpRwRQk
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a25:94b:0:b0:bb1:569c:f381 with SMTP id
 u11-20020a25094b000000b00bb1569cf381mr616589ybm.1.1685599157640; Wed, 31 May
 2023 22:59:17 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:15 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-13-jstultz@google.com>
Subject: [PATCH v4 12/13] sched: Attempt to fix rt/dl load balancing via chain
 level balance
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: "Connor O'Brien" <connoro@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com,
        John Stultz <jstultz@google.com>
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Connor O'Brien <connoro@google.com>

RT/DL balancing is supposed to guarantee that with N cpus available &
CPU affinity permitting, the top N RT/DL tasks will get spread across
the CPUs and all get to run. Proxy exec greatly complicates this as
blocked tasks remain on the rq but cannot be usefully migrated away
from their lock owning tasks. This has two major consequences:
1. In order to get the desired properties we need to migrate a blocked
task, its would-be proxy, and everything in between, all together -
i.e., we need to push/pull "blocked chains" rather than individual
tasks.
2. Tasks that are part of rq->curr's "blocked tree" therefore should
not be pushed or pulled. Options for enforcing this seem to include
a) create some sort of complex data structure for tracking
pushability, updating it whenever the blocked tree for rq->curr
changes (e.g. on mutex handoffs, migrations, etc.) as well as on
context switches.
b) give up on O(1) pushability checks, and search through the pushable
list every push/pull until we find a pushable "chain"
c) Extend option "b" with some sort of caching to avoid repeated work.

For the sake of simplicity & separating the "chain level balancing"
concerns from complicated optimizations, this patch focuses on trying
to implement option "b" correctly. This can then hopefully provide a
baseline for "correct load balancing behavior" that optimizations can
try to implement more efficiently.

Note:
The inability to atomically check "is task enqueued on a specific rq"
creates 2 possible races when following a blocked chain:
- If we check task_rq() first on a task that is dequeued from its rq,
  it can be woken and enqueued on another rq before the call to
  task_on_rq_queued()
- If we call task_on_rq_queued() first on a task that is on another
  rq, it can be dequeued (since we don't hold its rq's lock) and then
  be set to the current rq before we check task_rq().

Maybe there's a more elegant solution that would work, but for now,
just sandwich the task_rq() check between two task_on_rq_queued()
checks, all separated by smp_rmb() calls. Since we hold rq's lock,
task can't be enqueued or dequeued from rq, so neither race should be
possible.

extensive comments on various pitfalls, races, etc. included inline.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: rebased & sorted minor conflicts, folded down numerous
 fixes from Connor, fixed number of checkpatch issues]
Signed-off-by: John Stultz <jstultz@google.com>
---
v3:
* Fix crash by checking find_exec_ctx return for NULL before using it
v4:
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
* Moved most added functions from sched.h into core.c to be able
  to access __mutex_owner()
---
 kernel/sched/core.c        | 108 +++++++++++++++++++++-
 kernel/sched/cpudeadline.c |  12 +--
 kernel/sched/cpudeadline.h |   3 +-
 kernel/sched/cpupri.c      |  28 +++---
 kernel/sched/cpupri.h      |   6 +-
 kernel/sched/deadline.c    | 139 +++++++++++++++++-----------
 kernel/sched/rt.c          | 179 +++++++++++++++++++++++++------------
 kernel/sched/sched.h       |   8 +-
 8 files changed, 352 insertions(+), 131 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c56921dc427e..e0e6c2feefd0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2540,9 +2540,7 @@ int push_cpu_stop(void *arg)
=20
 	// XXX validate p is still the highest prio task
 	if (task_rq(p) =3D=3D rq) {
-		deactivate_task(rq, p, 0);
-		set_task_cpu(p, lowest_rq->cpu);
-		activate_task(lowest_rq, p, 0);
+		push_task_chain(rq, lowest_rq, p);
 		resched_curr(lowest_rq);
 	}
=20
@@ -3824,6 +3822,110 @@ ttwu_do_activate(struct rq *rq, struct task_struct =
*p, int wake_flags,
 #endif
 }
=20
+static inline bool task_queued_on_rq(struct rq *rq, struct task_struct *ta=
sk)
+{
+	if (!task_on_rq_queued(task))
+		return false;
+	smp_rmb();
+	if (task_rq(task) !=3D rq)
+		return false;
+	smp_rmb();
+	if (!task_on_rq_queued(task))
+		return false;
+	return true;
+}
+
+void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct =
*task)
+{
+	struct task_struct *owner;
+
+	lockdep_assert_rq_held(rq);
+	lockdep_assert_rq_held(dst_rq);
+
+	BUG_ON(!task_queued_on_rq(rq, task));
+	BUG_ON(task_current_selected(rq, task));
+
+	while (task) {
+		if (!task_queued_on_rq(rq, task) || task_current_selected(rq, task))
+			break;
+
+		if (task_is_blocked(task))
+			owner =3D __mutex_owner(task->blocked_on);
+		else
+			owner =3D NULL;
+
+		deactivate_task(rq, task, 0);
+		set_task_cpu(task, dst_rq->cpu);
+		activate_task(dst_rq, task, 0);
+		if (task =3D=3D owner)
+			break;
+		task =3D owner;
+	}
+}
+
+/*
+ * Returns the unblocked task at the end of the blocked chain starting wit=
h p
+ * if that chain is composed entirely of tasks enqueued on rq, or NULL oth=
erwise.
+ */
+struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p)
+{
+	struct task_struct *exec_ctx, *owner;
+	struct mutex *mutex;
+
+	lockdep_assert_rq_held(rq);
+
+	for (exec_ctx =3D p; task_is_blocked(exec_ctx) && !task_on_cpu(rq, exec_c=
tx);
+							exec_ctx =3D owner) {
+		mutex =3D exec_ctx->blocked_on;
+		owner =3D __mutex_owner(mutex);
+		if (owner =3D=3D exec_ctx)
+			break;
+
+		if (!task_queued_on_rq(rq, owner) || task_current_selected(rq, owner)) {
+			exec_ctx =3D NULL;
+			break;
+		}
+	}
+	return exec_ctx;
+}
+
+/*
+ * Returns:
+ * 1 if chain is pushable and affinity does not prevent pushing to cpu
+ * 0 if chain is unpushable
+ * -1 if chain is pushable but affinity blocks running on cpu.
+ */
+int pushable_chain(struct rq *rq, struct task_struct *p, int cpu)
+{
+	struct task_struct *exec_ctx;
+
+	lockdep_assert_rq_held(rq);
+
+	if (task_rq(p) !=3D rq || !task_on_rq_queued(p))
+		return 0;
+
+	exec_ctx =3D find_exec_ctx(rq, p);
+	/*
+	 * Chain leads off the rq, we're free to push it anywhere.
+	 *
+	 * One wrinkle with relying on find_exec_ctx is that when the chain
+	 * leads to a task currently migrating to rq, we see the chain as
+	 * pushable & push everything prior to the migrating task. Even if
+	 * we checked explicitly for this case, we could still race with a
+	 * migration after the check.
+	 * This shouldn't permanently produce a bad state though, as proxy()
+	 * will send the chain back to rq and by that point the migration
+	 * should be complete & a proper push can occur.
+	 */
+	if (!exec_ctx)
+		return 1;
+
+	if (task_on_cpu(rq, exec_ctx) || exec_ctx->nr_cpus_allowed <=3D 1)
+		return 0;
+
+	return cpumask_test_cpu(cpu, &exec_ctx->cpus_mask) ? 1 : -1;
+}
+
 #ifdef CONFIG_PROXY_EXEC
 bool ttwu_proxy_skip_wakeup(struct rq *rq, struct task_struct *p)
 {
diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
index 57c92d751bcd..efd6d716a3f2 100644
--- a/kernel/sched/cpudeadline.c
+++ b/kernel/sched/cpudeadline.c
@@ -113,13 +113,13 @@ static inline int cpudl_maximum(struct cpudl *cp)
  *
  * Returns: int - CPUs were found
  */
-int cpudl_find(struct cpudl *cp, struct task_struct *p,
+int cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx, struct tas=
k_struct *exec_ctx,
 	       struct cpumask *later_mask)
 {
-	const struct sched_dl_entity *dl_se =3D &p->dl;
+	const struct sched_dl_entity *dl_se =3D &sched_ctx->dl;
=20
 	if (later_mask &&
-	    cpumask_and(later_mask, cp->free_cpus, &p->cpus_mask)) {
+	    cpumask_and(later_mask, cp->free_cpus, &exec_ctx->cpus_mask)) {
 		unsigned long cap, max_cap =3D 0;
 		int cpu, max_cpu =3D -1;
=20
@@ -128,13 +128,13 @@ int cpudl_find(struct cpudl *cp, struct task_struct *=
p,
=20
 		/* Ensure the capacity of the CPUs fits the task. */
 		for_each_cpu(cpu, later_mask) {
-			if (!dl_task_fits_capacity(p, cpu)) {
+			if (!dl_task_fits_capacity(sched_ctx, cpu)) {
 				cpumask_clear_cpu(cpu, later_mask);
=20
 				cap =3D capacity_orig_of(cpu);
=20
 				if (cap > max_cap ||
-				    (cpu =3D=3D task_cpu(p) && cap =3D=3D max_cap)) {
+				    (cpu =3D=3D task_cpu(exec_ctx) && cap =3D=3D max_cap)) {
 					max_cap =3D cap;
 					max_cpu =3D cpu;
 				}
@@ -150,7 +150,7 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
=20
 		WARN_ON(best_cpu !=3D -1 && !cpu_present(best_cpu));
=20
-		if (cpumask_test_cpu(best_cpu, &p->cpus_mask) &&
+		if (cpumask_test_cpu(best_cpu, &exec_ctx->cpus_mask) &&
 		    dl_time_before(dl_se->deadline, cp->elements[0].dl)) {
 			if (later_mask)
 				cpumask_set_cpu(best_cpu, later_mask);
diff --git a/kernel/sched/cpudeadline.h b/kernel/sched/cpudeadline.h
index 0adeda93b5fb..6bb27f70e9d2 100644
--- a/kernel/sched/cpudeadline.h
+++ b/kernel/sched/cpudeadline.h
@@ -16,7 +16,8 @@ struct cpudl {
 };
=20
 #ifdef CONFIG_SMP
-int  cpudl_find(struct cpudl *cp, struct task_struct *p, struct cpumask *l=
ater_mask);
+int  cpudl_find(struct cpudl *cp, struct task_struct *sched_ctx,
+		struct task_struct *exec_ctx, struct cpumask *later_mask);
 void cpudl_set(struct cpudl *cp, int cpu, u64 dl);
 void cpudl_clear(struct cpudl *cp, int cpu);
 int  cpudl_init(struct cpudl *cp);
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index a286e726eb4b..fb4ddfde221e 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -96,11 +96,15 @@ static inline int __cpupri_find(struct cpupri *cp, stru=
ct task_struct *p,
 	if (skip)
 		return 0;
=20
-	if (cpumask_any_and(&p->cpus_mask, vec->mask) >=3D nr_cpu_ids)
+	if ((p && cpumask_any_and(&p->cpus_mask, vec->mask) >=3D nr_cpu_ids) ||
+	    (!p && cpumask_any(vec->mask) >=3D nr_cpu_ids))
 		return 0;
=20
 	if (lowest_mask) {
-		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		if (p)
+			cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
+		else
+			cpumask_copy(lowest_mask, vec->mask);
=20
 		/*
 		 * We have to ensure that we have at least one bit
@@ -117,10 +121,11 @@ static inline int __cpupri_find(struct cpupri *cp, st=
ruct task_struct *p,
 	return 1;
 }
=20
-int cpupri_find(struct cpupri *cp, struct task_struct *p,
+int cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
+		struct task_struct *exec_ctx,
 		struct cpumask *lowest_mask)
 {
-	return cpupri_find_fitness(cp, p, lowest_mask, NULL);
+	return cpupri_find_fitness(cp, sched_ctx, exec_ctx, lowest_mask, NULL);
 }
=20
 /**
@@ -140,18 +145,19 @@ int cpupri_find(struct cpupri *cp, struct task_struct=
 *p,
  *
  * Return: (int)bool - CPUs were found
  */
-int cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
-		struct cpumask *lowest_mask,
-		bool (*fitness_fn)(struct task_struct *p, int cpu))
+int cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
+			struct task_struct *exec_ctx,
+			struct cpumask *lowest_mask,
+			bool (*fitness_fn)(struct task_struct *p, int cpu))
 {
-	int task_pri =3D convert_prio(p->prio);
+	int task_pri =3D convert_prio(sched_ctx->prio);
 	int idx, cpu;
=20
 	WARN_ON_ONCE(task_pri >=3D CPUPRI_NR_PRIORITIES);
=20
 	for (idx =3D 0; idx < task_pri; idx++) {
=20
-		if (!__cpupri_find(cp, p, lowest_mask, idx))
+		if (!__cpupri_find(cp, exec_ctx, lowest_mask, idx))
 			continue;
=20
 		if (!lowest_mask || !fitness_fn)
@@ -159,7 +165,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_=
struct *p,
=20
 		/* Ensure the capacity of the CPUs fit the task */
 		for_each_cpu(cpu, lowest_mask) {
-			if (!fitness_fn(p, cpu))
+			if (!fitness_fn(sched_ctx, cpu))
 				cpumask_clear_cpu(cpu, lowest_mask);
 		}
=20
@@ -191,7 +197,7 @@ int cpupri_find_fitness(struct cpupri *cp, struct task_=
struct *p,
 	 * really care.
 	 */
 	if (fitness_fn)
-		return cpupri_find(cp, p, lowest_mask);
+		return cpupri_find(cp, sched_ctx, exec_ctx, lowest_mask);
=20
 	return 0;
 }
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..bde7243cec2e 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -18,9 +18,11 @@ struct cpupri {
 };
=20
 #ifdef CONFIG_SMP
-int  cpupri_find(struct cpupri *cp, struct task_struct *p,
+int  cpupri_find(struct cpupri *cp, struct task_struct *sched_ctx,
+		 struct task_struct *exec_ctx,
 		 struct cpumask *lowest_mask);
-int  cpupri_find_fitness(struct cpupri *cp, struct task_struct *p,
+int  cpupri_find_fitness(struct cpupri *cp, struct task_struct *sched_ctx,
+			 struct task_struct *exec_ctx,
 			 struct cpumask *lowest_mask,
 			 bool (*fitness_fn)(struct task_struct *p, int cpu));
 void cpupri_set(struct cpupri *cp, int cpu, int pri);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1d2711aee448..3cc8f96480e8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1814,7 +1814,7 @@ static inline bool dl_task_is_earliest_deadline(struc=
t task_struct *p,
 			       rq->dl.earliest_dl.curr));
 }
=20
-static int find_later_rq(struct task_struct *task);
+static int find_later_rq(struct task_struct *sched_ctx, struct task_struct=
 *exec_ctx);
=20
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
@@ -1854,7 +1854,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int=
 flags)
 		select_rq |=3D !dl_task_fits_capacity(p, cpu);
=20
 	if (select_rq) {
-		int target =3D find_later_rq(p);
+		int target =3D find_later_rq(p, p);
=20
 		if (target !=3D -1 &&
 		    dl_task_is_earliest_deadline(p, cpu_rq(target)))
@@ -1901,12 +1901,18 @@ static void migrate_task_rq_dl(struct task_struct *=
p, int new_cpu __maybe_unused
=20
 static void check_preempt_equal_dl(struct rq *rq, struct task_struct *p)
 {
+	struct task_struct *exec_ctx;
+
 	/*
 	 * Current can't be migrated, useless to reschedule,
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed =3D=3D 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), rq->curr, NULL))
+		return;
+
+	exec_ctx =3D find_exec_ctx(rq, p);
+	if (task_current(rq, exec_ctx))
 		return;
=20
 	/*
@@ -1914,7 +1920,7 @@ static void check_preempt_equal_dl(struct rq *rq, str=
uct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->nr_cpus_allowed !=3D 1 &&
-	    cpudl_find(&rq->rd->cpudl, p, NULL))
+	    cpudl_find(&rq->rd->cpudl, p, exec_ctx, NULL))
 		return;
=20
 	resched_curr(rq);
@@ -2084,14 +2090,6 @@ static void task_fork_dl(struct task_struct *p)
 /* Only try algorithms three times */
 #define DL_MAX_TRIES 3
=20
-static int pick_dl_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-	if (!task_on_cpu(rq, p) &&
-	    cpumask_test_cpu(cpu, &p->cpus_mask))
-		return 1;
-	return 0;
-}
-
 /*
  * Return the earliest pushable rq's task, which is suitable to be executed
  * on the CPU, NULL otherwise:
@@ -2110,7 +2108,7 @@ static struct task_struct *pick_earliest_pushable_dl_=
task(struct rq *rq, int cpu
 	if (next_node) {
 		p =3D __node_2_pdl(next_node);
=20
-		if (pick_dl_task(rq, p, cpu))
+		if (pushable_chain(rq, p, cpu) =3D=3D 1)
 			return p;
=20
 		next_node =3D rb_next(next_node);
@@ -2122,25 +2120,25 @@ static struct task_struct *pick_earliest_pushable_d=
l_task(struct rq *rq, int cpu
=20
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
=20
-static int find_later_rq(struct task_struct *task)
+static int find_later_rq(struct task_struct *sched_ctx, struct task_struct=
 *exec_ctx)
 {
 	struct sched_domain *sd;
 	struct cpumask *later_mask =3D this_cpu_cpumask_var_ptr(local_cpu_mask_dl=
);
 	int this_cpu =3D smp_processor_id();
-	int cpu =3D task_cpu(task);
+	int cpu =3D task_cpu(sched_ctx);
=20
 	/* Make sure the mask is initialized first */
 	if (unlikely(!later_mask))
 		return -1;
=20
-	if (task->nr_cpus_allowed =3D=3D 1)
+	if (exec_ctx && exec_ctx->nr_cpus_allowed =3D=3D 1)
 		return -1;
=20
 	/*
 	 * We have to consider system topology and task affinity
 	 * first, then we can look for a suitable CPU.
 	 */
-	if (!cpudl_find(&task_rq(task)->rd->cpudl, task, later_mask))
+	if (!cpudl_find(&task_rq(exec_ctx)->rd->cpudl, sched_ctx, exec_ctx, later=
_mask))
 		return -1;
=20
 	/*
@@ -2209,15 +2207,62 @@ static int find_later_rq(struct task_struct *task)
 	return -1;
 }
=20
+static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
+{
+	struct task_struct *p =3D NULL;
+	struct rb_node *next_node;
+
+	if (!has_pushable_dl_tasks(rq))
+		return NULL;
+
+	next_node =3D rb_first_cached(&rq->dl.pushable_dl_tasks_root);
+
+next_node:
+	if (next_node) {
+		p =3D __node_2_pdl(next_node);
+
+		/*
+		 * cpu argument doesn't matter because we treat a -1 result
+		 * (pushable but can't go to cpu0) the same as a 1 result
+		 * (pushable to cpu0). All we care about here is general
+		 * pushability.
+		 */
+		if (pushable_chain(rq, p, 0))
+			return p;
+
+		next_node =3D rb_next(next_node);
+		goto next_node;
+	}
+
+	if (!p)
+		return NULL;
+
+	WARN_ON_ONCE(rq->cpu !=3D task_cpu(p));
+	WARN_ON_ONCE(task_current(rq, p));
+	WARN_ON_ONCE(p->nr_cpus_allowed <=3D 1);
+
+	WARN_ON_ONCE(!task_on_rq_queued(p));
+	WARN_ON_ONCE(!dl_task(p));
+
+	return p;
+}
+
 /* Locks the rq it finds */
 static struct rq *find_lock_later_rq(struct task_struct *task, struct rq *=
rq)
 {
+	struct task_struct *exec_ctx;
 	struct rq *later_rq =3D NULL;
+	bool retry;
 	int tries;
 	int cpu;
=20
 	for (tries =3D 0; tries < DL_MAX_TRIES; tries++) {
-		cpu =3D find_later_rq(task);
+		retry =3D false;
+		exec_ctx =3D find_exec_ctx(rq, task);
+		if (!exec_ctx)
+			break;
+
+		cpu =3D find_later_rq(task, exec_ctx);
=20
 		if ((cpu =3D=3D -1) || (cpu =3D=3D rq->cpu))
 			break;
@@ -2236,12 +2281,29 @@ static struct rq *find_lock_later_rq(struct task_st=
ruct *task, struct rq *rq)
=20
 		/* Retry if something changed. */
 		if (double_lock_balance(rq, later_rq)) {
-			if (unlikely(task_rq(task) !=3D rq ||
-				     !cpumask_test_cpu(later_rq->cpu, &task->cpus_mask) ||
-				     task_on_cpu(rq, task) ||
-				     !dl_task(task) ||
-				     is_migration_disabled(task) ||
-				     !task_on_rq_queued(task))) {
+			bool fail =3D false;
+
+			if (!dl_task(task) || is_migration_disabled(task)) {
+				fail =3D true;
+			} else if (rq !=3D this_rq()) {
+				struct task_struct *next_task =3D pick_next_pushable_dl_task(rq);
+
+				if (next_task !=3D task) {
+					fail =3D true;
+				} else {
+					exec_ctx =3D find_exec_ctx(rq, next_task);
+					retry =3D (exec_ctx &&
+						!cpumask_test_cpu(later_rq->cpu,
+								  &exec_ctx->cpus_mask));
+				}
+			} else {
+				int pushable =3D pushable_chain(rq, task, later_rq->cpu);
+
+				fail =3D !pushable;
+				retry =3D pushable =3D=3D -1;
+			}
+
+			if (unlikely(fail)) {
 				double_unlock_balance(rq, later_rq);
 				later_rq =3D NULL;
 				break;
@@ -2253,7 +2315,7 @@ static struct rq *find_lock_later_rq(struct task_stru=
ct *task, struct rq *rq)
 		 * its earliest one has a later deadline than our
 		 * task, the rq is a good one.
 		 */
-		if (dl_task_is_earliest_deadline(task, later_rq))
+		if (!retry && dl_task_is_earliest_deadline(task, later_rq))
 			break;
=20
 		/* Otherwise we try again. */
@@ -2264,25 +2326,6 @@ static struct rq *find_lock_later_rq(struct task_str=
uct *task, struct rq *rq)
 	return later_rq;
 }
=20
-static struct task_struct *pick_next_pushable_dl_task(struct rq *rq)
-{
-	struct task_struct *p;
-
-	if (!has_pushable_dl_tasks(rq))
-		return NULL;
-
-	p =3D __node_2_pdl(rb_first_cached(&rq->dl.pushable_dl_tasks_root));
-
-	WARN_ON_ONCE(rq->cpu !=3D task_cpu(p));
-	WARN_ON_ONCE(task_current(rq, p));
-	WARN_ON_ONCE(p->nr_cpus_allowed <=3D 1);
-
-	WARN_ON_ONCE(!task_on_rq_queued(p));
-	WARN_ON_ONCE(!dl_task(p));
-
-	return p;
-}
-
 /*
  * See if the non running -deadline tasks on this rq
  * can be sent to some other CPU where they can preempt
@@ -2351,9 +2394,7 @@ static int push_dl_task(struct rq *rq)
 		goto retry;
 	}
=20
-	deactivate_task(rq, next_task, 0);
-	set_task_cpu(next_task, later_rq->cpu);
-	activate_task(later_rq, next_task, 0);
+	push_task_chain(rq, later_rq, next_task);
 	ret =3D 1;
=20
 	resched_curr(later_rq);
@@ -2439,9 +2480,7 @@ static void pull_dl_task(struct rq *this_rq)
 			if (is_migration_disabled(p)) {
 				push_task =3D get_push_task(src_rq);
 			} else {
-				deactivate_task(src_rq, p, 0);
-				set_task_cpu(p, this_cpu);
-				activate_task(this_rq, p, 0);
+				push_task_chain(src_rq, this_rq, p);
 				dmin =3D p->dl.deadline;
 				resched =3D true;
 			}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d6bffcf31de0..a1780c2c7101 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1599,7 +1599,7 @@ static void yield_task_rt(struct rq *rq)
 }
=20
 #ifdef CONFIG_SMP
-static int find_lowest_rq(struct task_struct *task);
+static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struc=
t *exec_ctx);
=20
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
@@ -1649,7 +1649,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int=
 flags)
 	       (curr->nr_cpus_allowed < 2 || selected->prio <=3D p->prio);
=20
 	if (test || !rt_task_fits_capacity(p, cpu)) {
-		int target =3D find_lowest_rq(p);
+		int target =3D find_lowest_rq(p, p);
=20
 		/*
 		 * Bail out if we were forcing a migration to find a better
@@ -1676,8 +1676,18 @@ select_task_rq_rt(struct task_struct *p, int cpu, in=
t flags)
=20
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
+	struct task_struct *exec_ctx =3D p;
+	/*
+	 * Current can't be migrated, useless to reschedule,
+	 * let's hope p can move out.
+	 */
 	if (rq->curr->nr_cpus_allowed =3D=3D 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), rq->curr, NULL))
+		return;
+
+	/* No reason to preempt since rq->curr wouldn't change anyway */
+	exec_ctx =3D find_exec_ctx(rq, p);
+	if (task_current(rq, exec_ctx))
 		return;
=20
 	/*
@@ -1685,7 +1695,7 @@ static void check_preempt_equal_prio(struct rq *rq, s=
truct task_struct *p)
 	 * see if it is pushed or pulled somewhere else.
 	 */
 	if (p->nr_cpus_allowed !=3D 1 &&
-	    cpupri_find(&rq->rd->cpupri, p, NULL))
+	    cpupri_find(&rq->rd->cpupri, p, exec_ctx, NULL))
 		return;
=20
 	/*
@@ -1851,15 +1861,6 @@ static void put_prev_task_rt(struct rq *rq, struct t=
ask_struct *p)
 /* Only try algorithms three times */
 #define RT_MAX_TRIES 3
=20
-static int pick_rt_task(struct rq *rq, struct task_struct *p, int cpu)
-{
-	if (!task_on_cpu(rq, p) &&
-	    cpumask_test_cpu(cpu, &p->cpus_mask))
-		return 1;
-
-	return 0;
-}
-
 /*
  * Return the highest pushable rq's task, which is suitable to be executed
  * on the CPU, NULL otherwise
@@ -1873,7 +1874,7 @@ static struct task_struct *pick_highest_pushable_task=
(struct rq *rq, int cpu)
 		return NULL;
=20
 	plist_for_each_entry(p, head, pushable_tasks) {
-		if (pick_rt_task(rq, p, cpu))
+		if (pushable_chain(rq, p, cpu) =3D=3D 1)
 			return p;
 	}
=20
@@ -1882,19 +1883,19 @@ static struct task_struct *pick_highest_pushable_ta=
sk(struct rq *rq, int cpu)
=20
 static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask);
=20
-static int find_lowest_rq(struct task_struct *task)
+static int find_lowest_rq(struct task_struct *sched_ctx, struct task_struc=
t *exec_ctx)
 {
 	struct sched_domain *sd;
 	struct cpumask *lowest_mask =3D this_cpu_cpumask_var_ptr(local_cpu_mask);
 	int this_cpu =3D smp_processor_id();
-	int cpu      =3D task_cpu(task);
+	int cpu      =3D task_cpu(sched_ctx);
 	int ret;
=20
 	/* Make sure the mask is initialized first */
 	if (unlikely(!lowest_mask))
 		return -1;
=20
-	if (task->nr_cpus_allowed =3D=3D 1)
+	if (exec_ctx && exec_ctx->nr_cpus_allowed =3D=3D 1)
 		return -1; /* No other targets possible */
=20
 	/*
@@ -1903,13 +1904,13 @@ static int find_lowest_rq(struct task_struct *task)
 	 */
 	if (sched_asym_cpucap_active()) {
=20
-		ret =3D cpupri_find_fitness(&task_rq(task)->rd->cpupri,
-					  task, lowest_mask,
+		ret =3D cpupri_find_fitness(&task_rq(sched_ctx)->rd->cpupri,
+					  sched_ctx, exec_ctx, lowest_mask,
 					  rt_task_fits_capacity);
 	} else {
=20
-		ret =3D cpupri_find(&task_rq(task)->rd->cpupri,
-				  task, lowest_mask);
+		ret =3D cpupri_find(&task_rq(sched_ctx)->rd->cpupri,
+				  sched_ctx, exec_ctx, lowest_mask);
 	}
=20
 	if (!ret)
@@ -1973,15 +1974,45 @@ static int find_lowest_rq(struct task_struct *task)
 	return -1;
 }
=20
+static struct task_struct *pick_next_pushable_task(struct rq *rq)
+{
+	struct plist_head *head =3D &rq->rt.pushable_tasks;
+	struct task_struct *p, *push_task =3D NULL;
+
+	if (!has_pushable_tasks(rq))
+		return NULL;
+
+	plist_for_each_entry(p, head, pushable_tasks) {
+		if (pushable_chain(rq, p, 0)) {
+			push_task =3D p;
+			break;
+		}
+	}
+
+	if (!push_task)
+		return NULL;
+
+	BUG_ON(rq->cpu !=3D task_cpu(push_task));
+	BUG_ON(task_current(rq, push_task) || task_current_selected(rq, push_task=
));
+	BUG_ON(!task_on_rq_queued(push_task));
+	BUG_ON(!rt_task(push_task));
+
+	return p;
+}
+
 /* Will lock the rq it finds */
 static struct rq *find_lock_lowest_rq(struct task_struct *task, struct rq =
*rq)
 {
+	struct task_struct *exec_ctx;
 	struct rq *lowest_rq =3D NULL;
+	bool retry;
 	int tries;
 	int cpu;
=20
 	for (tries =3D 0; tries < RT_MAX_TRIES; tries++) {
-		cpu =3D find_lowest_rq(task);
+		retry =3D false;
+		exec_ctx =3D find_exec_ctx(rq, task);
+		cpu =3D find_lowest_rq(task, exec_ctx);
=20
 		if ((cpu =3D=3D -1) || (cpu =3D=3D rq->cpu))
 			break;
@@ -2000,6 +2031,7 @@ static struct rq *find_lock_lowest_rq(struct task_str=
uct *task, struct rq *rq)
=20
 		/* if the prio of this runqueue changed, try again */
 		if (double_lock_balance(rq, lowest_rq)) {
+			bool fail =3D false;
 			/*
 			 * We had to unlock the run queue. In
 			 * the mean time, task could have
@@ -2008,14 +2040,71 @@ static struct rq *find_lock_lowest_rq(struct task_s=
truct *task, struct rq *rq)
 			 * It is possible the task was scheduled, set
 			 * "migrate_disabled" and then got preempted, so we must
 			 * check the task migration disable flag here too.
+			 *
+			 * Releasing the rq lock means we need to re-check pushability.
+			 * Some scenarios:
+			 * 1) If a migration from another CPU sent a task/chain to rq
+			 *    that made task newly unpushable by completing a chain
+			 *    from task to rq->curr, then we need to bail out and push something
+			 *    else.
+			 * 2) If our chain led off this CPU or to a dequeued task, the last wai=
ter
+			 *    on this CPU might have acquired the lock and woken (or even migra=
ted
+			 *    & run, handed off the lock it held, etc...). This can invalidate =
the
+			 *    result of find_lowest_rq() if our chain previously ended in a blo=
cked
+			 *    task whose affinity we could ignore, but now ends in an unblocked
+			 *    task that can't run on lowest_rq.
+			 * 3) Race described at https://lore.kernel.org/all/1523536384-26781-2-=
git-send-email-huawei.libin@huawei.com/
+			 *
+			 * Notes on these:
+			 * - Scenario #2 is properly handled by rerunning find_lowest_rq
+			 * - Scenario #1 requires that we fail
+			 * - Scenario #3 can AFAICT only occur when rq is not this_rq(). And the
+			 *   suggested fix is not universally correct now that push_cpu_stop() =
can
+			 *   call this function.
 			 */
-			if (unlikely(task_rq(task) !=3D rq ||
-				     !cpumask_test_cpu(lowest_rq->cpu, &task->cpus_mask) ||
-				     task_on_cpu(rq, task) ||
-				     !rt_task(task) ||
-				     is_migration_disabled(task) ||
-				     !task_on_rq_queued(task))) {
+			if (!rt_task(task) || is_migration_disabled(task)) {
+				fail =3D true;
+			} else if (rq !=3D this_rq()) {
+				/*
+				 * If we are dealing with a remote rq, then all bets are off
+				 * because task might have run & then been dequeued since we
+				 * released the lock, at which point our normal checks can race
+				 * with migration, as described in
+				 * https://lore.kernel.org/all/1523536384-26781-2-git-send-email-huawe=
i.libin@huawei.com/
+				 * Need to repick to ensure we avoid a race.
+				 * But re-picking would be unnecessary & incorrect in the
+				 * push_cpu_stop() path.
+				 */
+				struct task_struct *next_task =3D pick_next_pushable_task(rq);
+
+				if (next_task !=3D task) {
+					fail =3D true;
+				} else {
+					exec_ctx =3D find_exec_ctx(rq, next_task);
+					retry =3D (exec_ctx &&
+						 !cpumask_test_cpu(lowest_rq->cpu,
+								   &exec_ctx->cpus_mask));
+				}
+			} else {
+				/*
+				 * Chain level balancing introduces new ways for our choice of
+				 * task & rq to become invalid when we release the rq lock, e.g.:
+				 * 1) Migration to rq from another CPU makes task newly unpushable
+				 *    by completing a "blocked chain" from task to rq->curr.
+				 *    Fail so a different task can be chosen for push.
+				 * 2) In cases where task's blocked chain led to a dequeued task
+				 *    or one on another rq, the last waiter in the chain on this
+				 *    rq might have acquired the lock and woken, meaning we must
+				 *    pick a different rq if its affinity prevents running on
+				 *    lowest_rq.
+				 */
+				int pushable =3D pushable_chain(rq, task, lowest_rq->cpu);
=20
+				fail =3D !pushable;
+				retry =3D pushable =3D=3D -1;
+			}
+
+			if (unlikely(fail)) {
 				double_unlock_balance(rq, lowest_rq);
 				lowest_rq =3D NULL;
 				break;
@@ -2023,7 +2112,7 @@ static struct rq *find_lock_lowest_rq(struct task_str=
uct *task, struct rq *rq)
 		}
=20
 		/* If this rq is still suitable use it. */
-		if (lowest_rq->rt.highest_prio.curr > task->prio)
+		if (lowest_rq->rt.highest_prio.curr > task->prio && !retry)
 			break;
=20
 		/* try again */
@@ -2034,26 +2123,6 @@ static struct rq *find_lock_lowest_rq(struct task_st=
ruct *task, struct rq *rq)
 	return lowest_rq;
 }
=20
-static struct task_struct *pick_next_pushable_task(struct rq *rq)
-{
-	struct task_struct *p;
-
-	if (!has_pushable_tasks(rq))
-		return NULL;
-
-	p =3D plist_first_entry(&rq->rt.pushable_tasks,
-			      struct task_struct, pushable_tasks);
-
-	BUG_ON(rq->cpu !=3D task_cpu(p));
-	BUG_ON(task_current(rq, p) || task_current_selected(rq, p));
-	BUG_ON(p->nr_cpus_allowed <=3D 1);
-
-	BUG_ON(!task_on_rq_queued(p));
-	BUG_ON(!rt_task(p));
-
-	return p;
-}
-
 /*
  * If the current CPU has more than one RT task, see if the non
  * running task can migrate over to a CPU that is running a task
@@ -2099,10 +2168,10 @@ static int push_rt_task(struct rq *rq, bool pull)
 		 * Note that the stoppers are masqueraded as SCHED_FIFO
 		 * (cf. sched_set_stop_task()), so we can't rely on rt_task().
 		 */
-		if (rq->curr->sched_class !=3D &rt_sched_class)
+		if (rq_selected(rq)->sched_class !=3D &rt_sched_class)
 			return 0;
=20
-		cpu =3D find_lowest_rq(rq->curr);
+		cpu =3D find_lowest_rq(rq_selected(rq), rq->curr);
 		if (cpu =3D=3D -1 || cpu =3D=3D rq->cpu)
 			return 0;
=20
@@ -2164,9 +2233,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 		goto retry;
 	}
=20
-	deactivate_task(rq, next_task, 0);
-	set_task_cpu(next_task, lowest_rq->cpu);
-	activate_task(lowest_rq, next_task, 0);
+	push_task_chain(rq, lowest_rq, next_task);
 	resched_curr(lowest_rq);
 	ret =3D 1;
=20
@@ -2437,9 +2504,7 @@ static void pull_rt_task(struct rq *this_rq)
 			if (is_migration_disabled(p)) {
 				push_task =3D get_push_task(src_rq);
 			} else {
-				deactivate_task(src_rq, p, 0);
-				set_task_cpu(p, this_cpu);
-				activate_task(this_rq, p, 0);
+				push_task_chain(src_rq, this_rq, p);
 				resched =3D true;
 			}
 			/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1c832516b7e8..3a80cc4278ca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2366,7 +2366,7 @@ extern void set_cpus_allowed_common(struct task_struc=
t *p, struct affinity_conte
=20
 static inline struct task_struct *get_push_task(struct rq *rq)
 {
-	struct task_struct *p =3D rq->curr;
+	struct task_struct *p =3D rq_selected(rq);
=20
 	lockdep_assert_rq_held(rq);
=20
@@ -3530,4 +3530,10 @@ static inline void task_tick_mm_cid(struct rq *rq, s=
truct task_struct *curr) { }
 static inline void init_sched_mm_cid(struct task_struct *t) { }
 #endif
=20
+#ifdef CONFIG_SMP
+void push_task_chain(struct rq *rq, struct rq *dst_rq, struct task_struct =
*task);
+struct task_struct *find_exec_ctx(struct rq *rq, struct task_struct *p);
+int pushable_chain(struct rq *rq, struct task_struct *p, int cpu);
+#endif
+
 #endif /* _KERNEL_SCHED_SCHED_H */
--=20
2.41.0.rc0.172.g3f132b7071-goog
From nobody Sat Feb  7 18:15:59 2026
Return-Path: <linux-kernel-owner@vger.kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 89E27C77B7A
	for <linux-kernel@archiver.kernel.org>; Thu,  1 Jun 2023 06:00:58 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S231629AbjFAGA4 (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 1 Jun 2023 02:00:56 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33436 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S231830AbjFAF7w (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2023 01:59:52 -0400
Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com
 [IPv6:2607:f8b0:4864:20::64a])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 332DBE5D
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:20 -0700 (PDT)
Received: by mail-pl1-x64a.google.com with SMTP id
 d9443c01a7336-1b03ae23cf7so1285925ad.3
        for <linux-kernel@vger.kernel.org>;
 Wed, 31 May 2023 22:59:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1685599159; x=1688191159;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=Nl83JFmq5DGiAWWnIdFRVnY3DNvwp/CATr/wZmCpEuo=;
        b=ViVgx/PuN9MLmdNybo41KvSJ+mqfr/QrzY3xE1C+mEnv4qvIG8Vh4e4lAZq8o/pi+6
         BuAai3UElvNic61P7FJHBK+0xOElM9YDgRpupsAZv1HIXfQmHojcQ++RThtOAZJB6D0K
         bxY2XDj3I7lytKbMUskON/IwrVL/HImFBiAiwCqo0NzGWFYJDfkXpysgIwln2UjsXC0e
         AvYTEnwtdIZZSq7h5TsIHyjlmWOQDHDKAujHxLqnwpcZ6OY5V+v0PlkibSTLcFnleBdn
         yTvV9sN/zEDtIsOld6sHqGEou51tPudWF8Gc4dGA+lmXHxjeaqBO6R0jETCk7lNAA4cI
         V0oA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685599159; x=1688191159;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=Nl83JFmq5DGiAWWnIdFRVnY3DNvwp/CATr/wZmCpEuo=;
        b=XQ2nRmJfQYaYx4GuVvJ5cJvGp2WYSGYwyzlDJJnqABiKd7AjokHv+KdNIoHJVFPvtB
         n7Dfpnu1pcRLE2NQ+44EPZ5PviUu8IX1DP/g+QPbbqG4Rd4AfcVKQstcUBeh1aCuwhWp
         UCjpvjbiYS7RGt6uUjNp1tSqYsdbXTKRZR6JnI+kRDHtF0Qc1EWsH2PpTROdLJuhsI6H
         11hde5t3ITWIAAuWi448tdKZuK5mYk71MYD3fFEO9Qpzwz5mWv7jkER80L8sOgS6fhuR
         /kctzMpdPFtDb9v/rGWykHLp5Yormp5c6+zo+EF00sF6CgmtEjXSjGUBct+COIqOSUnK
         zBaA==
X-Gm-Message-State: AC+VfDz3E9mbcCmXIpVmgsDxlyQTPogAeXHY77aA1i04GeFBvx0wnXnF
        tdc5oTe04ARoxBGPqg1/mK0zvsDhTWx508P5/kHU9AEMaaQZr66yKdxocu6OEvENkynCTVR58ma
        x+0BWsJTasVrPeL4vvL4iR2mhT2LtyO4/40Z0Hs5nDLKDbtTL1v6NPSunVmFrosMe6JqHOTk=
X-Google-Smtp-Source: 
 ACHHUZ5czSEkrQ1QvdESejXEehzuZ9LeefxgyNbdSBV49grdID/qUzTJS4qb46VGW9GXN1fKEWt03ww/lf9w
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a17:902:8486:b0:1a6:6bdb:b542 with SMTP id
 c6-20020a170902848600b001a66bdbb542mr1830304plo.9.1685599159247; Wed, 31 May
 2023 22:59:19 -0700 (PDT)
Date: Thu,  1 Jun 2023 05:58:16 +0000
In-Reply-To: <20230601055846.2349566-1-jstultz@google.com>
Mime-Version: 1.0
References: <20230601055846.2349566-1-jstultz@google.com>
X-Mailer: git-send-email 2.41.0.rc0.172.g3f132b7071-goog
Message-ID: <20230601055846.2349566-14-jstultz@google.com>
Subject: [PATCH v4 13/13] sched: Fixups to find_exec_ctx
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: John Stultz <jstultz@google.com>,
        Joel Fernandes <joelaf@google.com>,
        Qais Yousef <qyousef@google.com>,
        Ingo Molnar <mingo@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Juri Lelli <juri.lelli@redhat.com>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Valentin Schneider <vschneid@redhat.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        Ben Segall <bsegall@google.com>,
        Zimuzo Ezeozue <zezeozue@google.com>,
        Youssef Esmat <youssefesmat@google.com>,
        Mel Gorman <mgorman@suse.de>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        "Paul E . McKenney" <paulmck@kernel.org>, kernel-team@android.com
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

find_exec_ctx() would sometimes cause the rt task pushing
to try to push tasks in the chain that ends in the rq->curr.

This caused lots of migration noise and effecively livelock
 where tasks would get pushed off to other cpus, then
proxy-migrated back to the lockowner's cpu, over and over.

This kept other cpus constantly proxy-migrating away and
never actually selecting a task to run  - effectively
hanging the system.

So this patch reworks some of the find_exec_ctx logic
so we stop when we hit rq->curr, and changes the logic
that was returning NULL when we came across
rq_selected(), as I'm not sure why we'd stop there.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e0e6c2feefd0..9cdabb79d450 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3881,7 +3881,15 @@ struct task_struct *find_exec_ctx(struct rq *rq, str=
uct task_struct *p)
 		if (owner =3D=3D exec_ctx)
 			break;
=20
-		if (!task_queued_on_rq(rq, owner) || task_current_selected(rq, owner)) {
+		/* If we get to current, that's the exec ctx! */
+		if (task_current(rq, owner))
+			return owner;
+
+		/*
+		 * XXX This previously was checking task_current_selected()
+		 * but that doesnt' make much sense to me. -jstultz
+		 */
+		if (!task_queued_on_rq(rq, owner)) {
 			exec_ctx =3D NULL;
 			break;
 		}
--=20
2.41.0.rc0.172.g3f132b7071-goog