From nobody Sun Feb  8 00:27:08 2026
Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com
 [209.85.219.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 30632748C
	for <linux-kernel@vger.kernel.org>; Wed, 20 Dec 2023 00:19:17 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jstultz.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="bwz6nykz"
Received: by mail-yb1-f201.google.com with SMTP id
 3f1490d57ef6-dbd4a080c0bso2649248276.1
        for <linux-kernel@vger.kernel.org>;
 Tue, 19 Dec 2023 16:19:17 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1703031557; x=1703636357;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=oaiagOlY2yIt6EK/2XFKSdkelCGW18s3codp/jhpWBM=;
        b=bwz6nykzmkIWGdCr1EJDLaVn1u4PbOs3bpvbU45haNGX8frTRMhcPUsi68bpRF3eg3
         erMSgp+YL2qGBXRR1EFCIPYmwzgrOUGAakqHtEa23zea1OP+BWwWLrVURLSGEHgBpILq
         zRutLEVscxD8jpAweHiRg7/P9n2wQHo3xtBq/bivIKWdqPD/zbxyzpwBpFC4mF88My4w
         RCMD9B35Q1ZUtPO/epa6sXP+TfLGF3KaXYwQxo4Leo6OMGtRwCIXTVnLKBzaN7FSVOEt
         O7H0xIYLUKkoiA2Qlxzp293xhF+kwNP2UfTxfP+1ADdhNSZE+q+dC9OMgKkXdeLgXKEJ
         Loqw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1703031557; x=1703636357;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=oaiagOlY2yIt6EK/2XFKSdkelCGW18s3codp/jhpWBM=;
        b=OSP+eEPJFgBCZ3IjSa193aiarA21FL95IcWn1JyhZvdq/IKKOzqzy4oQ120OsF1Q7v
         RdajXtPq4XX9nKnNcPzbooUR+lT1I1LMa55474empb8OYzw7Vm5sFzmwrbydj6KnYMpY
         XHfOh3aG4DNPQbQJ7kl1AGhk17c7XiHEXPrewPPTOLIcEmLffpRnMiGEbRlc7Tg13dAo
         40UEgnOWg83gl+CDavU5HlqtLYYYhP9GE0WFT/kYMNyryv9gVVcbSmcieFI4CCLgfE+z
         TcHsZxVH8fC5R4BEC1WX1oPxPFXVaZ+/uceaP+xOo5H89r3rgmgGnWP6nQgVcOQr7JXc
         ZKxA==
X-Gm-Message-State: AOJu0YzJgEVA1v3dVxwvXiMWSXY5y9BbTehLeNTsxe2JK1Jdk4Efj+Uo
	crlB/xqrqpC/q6W0DwFFVujRR3gICVN006RMWLHBdCPFAE15usdOtUDnnqtVhhmmkFEsQIEHkSB
	UanmdfgbQOjakz/KB4JyqaGoj5tMQTk7yK1wgVk/M5jwEzfedqiEJhaLjRf0/zhYOnbh9lnc=
X-Google-Smtp-Source: 
 AGHT+IEk5kzkoyXxgClT4wlkTjRmZvesGSdT9maK6Vao9SNI3mL1m0p2Iu8eWuJOPmuBRh3ybmn7JX1oh0OW
X-Received: from jstultz-noogler2.c.googlers.com
 ([fda3:e722:ac3:cc00:24:72f4:c0a8:600])
 (user=jstultz job=sendgmr) by 2002:a05:6902:1341:b0:dbd:ab89:659d with SMTP
 id g1-20020a056902134100b00dbdab89659dmr22428ybu.3.1703031556773; Tue, 19 Dec
 2023 16:19:16 -0800 (PST)
Date: Tue, 19 Dec 2023 16:18:19 -0800
In-Reply-To: <20231220001856.3710363-1-jstultz@google.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20231220001856.3710363-1-jstultz@google.com>
X-Mailer: git-send-email 2.43.0.472.g3155946c3a-goog
Message-ID: <20231220001856.3710363-9-jstultz@google.com>
Subject: [PATCH v7 08/23] sched: Split scheduler and execution contexts
From: John Stultz <jstultz@google.com>
To: LKML <linux-kernel@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>, Joel Fernandes <joelaf@google.com>,
	Qais Yousef <qyousef@google.com>, Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
 Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
 Valentin Schneider <vschneid@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>, Ben Segall <bsegall@google.com>,
	Zimuzo Ezeozue <zezeozue@google.com>,
 Youssef Esmat <youssefesmat@google.com>,
	Mel Gorman <mgorman@suse.de>,
 Daniel Bristot de Oliveira <bristot@redhat.com>,
 Will Deacon <will@kernel.org>,
	Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>, Xuewen Yan <xuewen.yan94@gmail.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>, Metin Kaya <Metin.Kaya@arm.com>,
	Thomas Gleixner <tglx@linutronix.de>, kernel-team@android.com,
	"Connor O'Brien" <connoro@google.com>, John Stultz <jstultz@google.com>
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: Peter Zijlstra <peterz@infradead.org>

Let's define the scheduling context as all the scheduler state
in task_struct for the task selected to run, and the execution
context as all state required to actually run the task.

Currently both are intertwined in task_struct. We want to
logically split these such that we can use the scheduling
context of the task selected to be scheduled, but use the
execution context of a different task to actually be run.

To this purpose, introduce rq_selected() macro to point to the
task_struct selected from the runqueue by the scheduler, and
will be used for scheduler state, and preserve rq->curr to
indicate the execution context of the task that will actually be
run.

NOTE: Peter previously mentioned he didn't like the name
"rq_selected()", but I've not come up with a better alternative.
I'm very open to other name proposals.

Question for Peter: Dietmar suggested you'd prefer I drop the
conditionalization of the scheduler context pointer on the rq
(so rq_selected() would be open coded as rq->curr_selected or
whatever we agree on for a name), but I'd think in the
!CONFIG_PROXY_EXEC case we'd want to avoid the wasted pointer
and its use (since it curr_selected would always be =3D=3D curr)?
If I'm wrong I'm fine switching this, but would appreciate
clarification.

Cc: Joel Fernandes <joelaf@google.com>
Cc: Qais Yousef <qyousef@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Youssef Esmat <youssefesmat@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@android.com
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20181009092434.26221-5-juri.lelli@redhat.com
[add additional comments and update more sched_class code to use
 rq::proxy]
Signed-off-by: Connor O'Brien <connoro@google.com>
[jstultz: Rebased and resolved minor collisions, reworked to use
 accessors, tweaked update_curr_common to use rq_proxy fixing rt
 scheduling issues]
Signed-off-by: John Stultz <jstultz@google.com>
---
v2:
* Reworked to use accessors
* Fixed update_curr_common to use proxy instead of curr
v3:
* Tweaked wrapper names
* Swapped proxy for selected for clarity
v4:
* Minor variable name tweaks for readability
* Use a macro instead of a inline function and drop
  other helper functions as suggested by Peter.
* Remove verbose comments/questions to avoid review
  distractions, as suggested by Dietmar
v5:
* Add CONFIG_PROXY_EXEC option to this patch so the
  new logic can be tested with this change
* Minor fix to grab rq_selected when holding the rq lock
v7:
* Minor spelling fix and unused argument fixes suggested by
  Metin Kaya
* Switch to curr_selected for consistency, and minor rewording
  of commit message for clarity
* Rename variables selected instead of curr when we're using
  rq_selected()
* Reduce macros in CONFIG_SCHED_PROXY_EXEC ifdef sections,
  as suggested by Metin Kaya
---
 kernel/sched/core.c     | 46 ++++++++++++++++++++++++++---------------
 kernel/sched/deadline.c | 35 ++++++++++++++++---------------
 kernel/sched/fair.c     | 18 ++++++++--------
 kernel/sched/rt.c       | 40 +++++++++++++++++------------------
 kernel/sched/sched.h    | 35 +++++++++++++++++++++++++++++--
 5 files changed, 109 insertions(+), 65 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e06558fb08aa..0ce34f5c0e0c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -822,7 +822,7 @@ static enum hrtimer_restart hrtick(struct hrtimer *time=
r)
=20
 	rq_lock(rq, &rf);
 	update_rq_clock(rq);
-	rq->curr->sched_class->task_tick(rq, rq->curr, 1);
+	rq_selected(rq)->sched_class->task_tick(rq, rq_selected(rq), 1);
 	rq_unlock(rq, &rf);
=20
 	return HRTIMER_NORESTART;
@@ -2242,16 +2242,18 @@ static inline void check_class_changed(struct rq *r=
q, struct task_struct *p,
=20
 void wakeup_preempt(struct rq *rq, struct task_struct *p, int flags)
 {
-	if (p->sched_class =3D=3D rq->curr->sched_class)
-		rq->curr->sched_class->wakeup_preempt(rq, p, flags);
-	else if (sched_class_above(p->sched_class, rq->curr->sched_class))
+	struct task_struct *selected =3D rq_selected(rq);
+
+	if (p->sched_class =3D=3D selected->sched_class)
+		selected->sched_class->wakeup_preempt(rq, p, flags);
+	else if (sched_class_above(p->sched_class, selected->sched_class))
 		resched_curr(rq);
=20
 	/*
 	 * A queue event has occurred, and we're going to schedule.  In
 	 * this case, we can save a useless back to back clock update.
 	 */
-	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
+	if (task_on_rq_queued(selected) && test_tsk_need_resched(rq->curr))
 		rq_clock_skip_update(rq);
 }
=20
@@ -2780,7 +2782,7 @@ __do_set_cpus_allowed(struct task_struct *p, struct a=
ffinity_context *ctx)
 		lockdep_assert_held(&p->pi_lock);
=20
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
=20
 	if (queued) {
 		/*
@@ -5600,7 +5602,7 @@ unsigned long long task_sched_runtime(struct task_str=
uct *p)
 	 * project cycles that may never be accounted to this
 	 * thread, breaking clock_gettime().
 	 */
-	if (task_current(rq, p) && task_on_rq_queued(p)) {
+	if (task_current_selected(rq, p) && task_on_rq_queued(p)) {
 		prefetch_curr_exec_start(p);
 		update_rq_clock(rq);
 		p->sched_class->update_curr(rq);
@@ -5668,7 +5670,8 @@ void scheduler_tick(void)
 {
 	int cpu =3D smp_processor_id();
 	struct rq *rq =3D cpu_rq(cpu);
-	struct task_struct *curr =3D rq->curr;
+	/* accounting goes to the selected task */
+	struct task_struct *selected;
 	struct rq_flags rf;
 	unsigned long thermal_pressure;
 	u64 resched_latency;
@@ -5679,16 +5682,17 @@ void scheduler_tick(void)
 	sched_clock_tick();
=20
 	rq_lock(rq, &rf);
+	selected =3D rq_selected(rq);
=20
 	update_rq_clock(rq);
 	thermal_pressure =3D arch_scale_thermal_pressure(cpu_of(rq));
 	update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
-	curr->sched_class->task_tick(rq, curr, 0);
+	selected->sched_class->task_tick(rq, selected, 0);
 	if (sched_feat(LATENCY_WARN))
 		resched_latency =3D cpu_resched_latency(rq);
 	calc_global_load_tick(rq);
 	sched_core_tick(rq);
-	task_tick_mm_cid(rq, curr);
+	task_tick_mm_cid(rq, selected);
=20
 	rq_unlock(rq, &rf);
=20
@@ -5697,8 +5701,8 @@ void scheduler_tick(void)
=20
 	perf_event_task_tick();
=20
-	if (curr->flags & PF_WQ_WORKER)
-		wq_worker_tick(curr);
+	if (selected->flags & PF_WQ_WORKER)
+		wq_worker_tick(selected);
=20
 #ifdef CONFIG_SMP
 	rq->idle_balance =3D idle_cpu(cpu);
@@ -5763,6 +5767,12 @@ static void sched_tick_remote(struct work_struct *wo=
rk)
 		struct task_struct *curr =3D rq->curr;
=20
 		if (cpu_online(cpu)) {
+			/*
+			 * Since this is a remote tick for full dynticks mode,
+			 * we are always sure that there is no proxy (only a
+			 * single task is running).
+			 */
+			SCHED_WARN_ON(rq->curr !=3D rq_selected(rq));
 			update_rq_clock(rq);
=20
 			if (!is_idle_task(curr)) {
@@ -6685,6 +6695,7 @@ static void __sched notrace __schedule(unsigned int s=
ched_mode)
 	}
=20
 	next =3D pick_next_task(rq, prev, &rf);
+	rq_set_selected(rq, next);
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
 #ifdef CONFIG_SCHED_DEBUG
@@ -7185,7 +7196,7 @@ void rt_mutex_setprio(struct task_struct *p, struct t=
ask_struct *pi_task)
=20
 	prev_class =3D p->sched_class;
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flag);
 	if (running)
@@ -7275,7 +7286,7 @@ void set_user_nice(struct task_struct *p, long nice)
 	}
=20
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
 	if (running)
@@ -7868,7 +7879,7 @@ static int __sched_setscheduler(struct task_struct *p,
 	}
=20
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
 	if (queued)
 		dequeue_task(rq, p, queue_flags);
 	if (running)
@@ -9295,6 +9306,7 @@ void __init init_idle(struct task_struct *idle, int c=
pu)
 	rcu_read_unlock();
=20
 	rq->idle =3D idle;
+	rq_set_selected(rq, idle);
 	rcu_assign_pointer(rq->curr, idle);
 	idle->on_rq =3D TASK_ON_RQ_QUEUED;
 #ifdef CONFIG_SMP
@@ -9384,7 +9396,7 @@ void sched_setnuma(struct task_struct *p, int nid)
=20
 	rq =3D task_rq_lock(p, &rf);
 	queued =3D task_on_rq_queued(p);
-	running =3D task_current(rq, p);
+	running =3D task_current_selected(rq, p);
=20
 	if (queued)
 		dequeue_task(rq, p, DEQUEUE_SAVE);
@@ -10489,7 +10501,7 @@ void sched_move_task(struct task_struct *tsk)
=20
 	update_rq_clock(rq);
=20
-	running =3D task_current(rq, tsk);
+	running =3D task_current_selected(rq, tsk);
 	queued =3D task_on_rq_queued(tsk);
=20
 	if (queued)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 6140f1f51da1..9cf20f4ac5f9 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1150,7 +1150,7 @@ static enum hrtimer_restart dl_task_timer(struct hrti=
mer *timer)
 #endif
=20
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
-	if (dl_task(rq->curr))
+	if (dl_task(rq_selected(rq)))
 		wakeup_preempt_dl(rq, p, 0);
 	else
 		resched_curr(rq);
@@ -1273,7 +1273,7 @@ static u64 grub_reclaim(u64 delta, struct rq *rq, str=
uct sched_dl_entity *dl_se)
  */
 static void update_curr_dl(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	struct sched_dl_entity *dl_se =3D &curr->dl;
 	s64 delta_exec, scaled_delta_exec;
 	int cpu =3D cpu_of(rq);
@@ -1784,7 +1784,7 @@ static int find_later_rq(struct task_struct *task);
 static int
 select_task_rq_dl(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	bool select_rq;
 	struct rq *rq;
=20
@@ -1795,6 +1795,7 @@ select_task_rq_dl(struct task_struct *p, int cpu, int=
 flags)
=20
 	rcu_read_lock();
 	curr =3D READ_ONCE(rq->curr); /* unlocked access */
+	selected =3D READ_ONCE(rq_selected(rq));
=20
 	/*
 	 * If we are dealing with a -deadline task, we must
@@ -1805,9 +1806,9 @@ select_task_rq_dl(struct task_struct *p, int cpu, int=
 flags)
 	 * other hand, if it has a shorter deadline, we
 	 * try to make it stay here, it might be important.
 	 */
-	select_rq =3D unlikely(dl_task(curr)) &&
+	select_rq =3D unlikely(dl_task(selected)) &&
 		    (curr->nr_cpus_allowed < 2 ||
-		     !dl_entity_preempt(&p->dl, &curr->dl)) &&
+		     !dl_entity_preempt(&p->dl, &selected->dl)) &&
 		    p->nr_cpus_allowed > 1;
=20
 	/*
@@ -1870,7 +1871,7 @@ static void check_preempt_equal_dl(struct rq *rq, str=
uct task_struct *p)
 	 * let's hope p can move out.
 	 */
 	if (rq->curr->nr_cpus_allowed =3D=3D 1 ||
-	    !cpudl_find(&rq->rd->cpudl, rq->curr, NULL))
+	    !cpudl_find(&rq->rd->cpudl, rq_selected(rq), NULL))
 		return;
=20
 	/*
@@ -1909,7 +1910,7 @@ static int balance_dl(struct rq *rq, struct task_stru=
ct *p, struct rq_flags *rf)
 static void wakeup_preempt_dl(struct rq *rq, struct task_struct *p,
 				  int flags)
 {
-	if (dl_entity_preempt(&p->dl, &rq->curr->dl)) {
+	if (dl_entity_preempt(&p->dl, &rq_selected(rq)->dl)) {
 		resched_curr(rq);
 		return;
 	}
@@ -1919,7 +1920,7 @@ static void wakeup_preempt_dl(struct rq *rq, struct t=
ask_struct *p,
 	 * In the unlikely case current and p have the same deadline
 	 * let us try to decide what's the best thing to do...
 	 */
-	if ((p->dl.deadline =3D=3D rq->curr->dl.deadline) &&
+	if ((p->dl.deadline =3D=3D rq_selected(rq)->dl.deadline) &&
 	    !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_dl(rq, p);
 #endif /* CONFIG_SMP */
@@ -1954,7 +1955,7 @@ static void set_next_task_dl(struct rq *rq, struct ta=
sk_struct *p, bool first)
 	if (hrtick_enabled_dl(rq))
 		start_hrtick_dl(rq, p);
=20
-	if (rq->curr->sched_class !=3D &dl_sched_class)
+	if (rq_selected(rq)->sched_class !=3D &dl_sched_class)
 		update_dl_rq_load_avg(rq_clock_pelt(rq), rq, 0);
=20
 	deadline_queue_push_tasks(rq);
@@ -2268,8 +2269,8 @@ static int push_dl_task(struct rq *rq)
 	 * can move away, it makes sense to just reschedule
 	 * without going further in pushing next_task.
 	 */
-	if (dl_task(rq->curr) &&
-	    dl_time_before(next_task->dl.deadline, rq->curr->dl.deadline) &&
+	if (dl_task(rq_selected(rq)) &&
+	    dl_time_before(next_task->dl.deadline, rq_selected(rq)->dl.deadline) =
&&
 	    rq->curr->nr_cpus_allowed > 1) {
 		resched_curr(rq);
 		return 0;
@@ -2394,7 +2395,7 @@ static void pull_dl_task(struct rq *this_rq)
 			 * deadline than the current task of its runqueue.
 			 */
 			if (dl_time_before(p->dl.deadline,
-					   src_rq->curr->dl.deadline))
+					   rq_selected(src_rq)->dl.deadline))
 				goto skip;
=20
 			if (is_migration_disabled(p)) {
@@ -2435,9 +2436,9 @@ static void task_woken_dl(struct rq *rq, struct task_=
struct *p)
 	if (!task_on_cpu(rq, p) &&
 	    !test_tsk_need_resched(rq->curr) &&
 	    p->nr_cpus_allowed > 1 &&
-	    dl_task(rq->curr) &&
+	    dl_task(rq_selected(rq)) &&
 	    (rq->curr->nr_cpus_allowed < 2 ||
-	     !dl_entity_preempt(&p->dl, &rq->curr->dl))) {
+	     !dl_entity_preempt(&p->dl, &rq_selected(rq)->dl))) {
 		push_dl_tasks(rq);
 	}
 }
@@ -2612,12 +2613,12 @@ static void switched_to_dl(struct rq *rq, struct ta=
sk_struct *p)
 		return;
 	}
=20
-	if (rq->curr !=3D p) {
+	if (rq_selected(rq) !=3D p) {
 #ifdef CONFIG_SMP
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
 #endif
-		if (dl_task(rq->curr))
+		if (dl_task(rq_selected(rq)))
 			wakeup_preempt_dl(rq, p, 0);
 		else
 			resched_curr(rq);
@@ -2646,7 +2647,7 @@ static void prio_changed_dl(struct rq *rq, struct tas=
k_struct *p,
 	if (!rq->dl.overloaded)
 		deadline_queue_pull_task(rq);
=20
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		/*
 		 * If we now have a earlier deadline task than p,
 		 * then reschedule, provided p is still on this
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1251fd01a555..07216ea3ed53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1157,7 +1157,7 @@ static s64 update_curr_se(struct rq *rq, struct sched=
_entity *curr)
  */
 s64 update_curr_common(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	s64 delta_exec;
=20
 	delta_exec =3D update_curr_se(rq, &curr->se);
@@ -1203,7 +1203,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
=20
 static void update_curr_fair(struct rq *rq)
 {
-	update_curr(cfs_rq_of(&rq->curr->se));
+	update_curr(cfs_rq_of(&rq_selected(rq)->se));
 }
=20
 static inline void
@@ -6611,7 +6611,7 @@ static void hrtick_start_fair(struct rq *rq, struct t=
ask_struct *p)
 		s64 delta =3D slice - ran;
=20
 		if (delta < 0) {
-			if (task_current(rq, p))
+			if (task_current_selected(rq, p))
 				resched_curr(rq);
 			return;
 		}
@@ -6626,7 +6626,7 @@ static void hrtick_start_fair(struct rq *rq, struct t=
ask_struct *p)
  */
 static void hrtick_update(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
=20
 	if (!hrtick_enabled_fair(rq) || curr->sched_class !=3D &fair_sched_class)
 		return;
@@ -8235,7 +8235,7 @@ static void set_next_buddy(struct sched_entity *se)
  */
 static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p=
, int wake_flags)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	struct sched_entity *se =3D &curr->se, *pse =3D &p->se;
 	struct cfs_rq *cfs_rq =3D task_cfs_rq(curr);
 	int next_buddy_marked =3D 0;
@@ -8268,7 +8268,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, =
struct task_struct *p, int
 	 * prevents us from potentially nominating it as a false LAST_BUDDY
 	 * below.
 	 */
-	if (test_tsk_need_resched(curr))
+	if (test_tsk_need_resched(rq->curr))
 		return;
=20
 	/* Idle tasks are by definition preempted by non-idle tasks. */
@@ -9252,7 +9252,7 @@ static bool __update_blocked_others(struct rq *rq, bo=
ol *done)
 	 * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
 	 * DL and IRQ signals have been updated before updating CFS.
 	 */
-	curr_class =3D rq->curr->sched_class;
+	curr_class =3D rq_selected(rq)->sched_class;
=20
 	thermal_pressure =3D arch_scale_thermal_pressure(cpu_of(rq));
=20
@@ -12640,7 +12640,7 @@ prio_changed_fair(struct rq *rq, struct task_struct=
 *p, int oldprio)
 	 * our priority decreased, or if we are not currently running on
 	 * this runqueue and our priority is higher than the current's
 	 */
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 		if (p->prio > oldprio)
 			resched_curr(rq);
 	} else
@@ -12743,7 +12743,7 @@ static void switched_to_fair(struct rq *rq, struct =
task_struct *p)
 		 * kick off the schedule if running, otherwise just see
 		 * if we can still preempt the current task.
 		 */
-		if (task_current(rq, p))
+		if (task_current_selected(rq, p))
 			resched_curr(rq);
 		else
 			wakeup_preempt(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9cdea3ea47da..2682cec45aaa 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -530,7 +530,7 @@ static void dequeue_rt_entity(struct sched_rt_entity *r=
t_se, unsigned int flags)
=20
 static void sched_rt_rq_enqueue(struct rt_rq *rt_rq)
 {
-	struct task_struct *curr =3D rq_of_rt_rq(rt_rq)->curr;
+	struct task_struct *curr =3D rq_selected(rq_of_rt_rq(rt_rq));
 	struct rq *rq =3D rq_of_rt_rq(rt_rq);
 	struct sched_rt_entity *rt_se;
=20
@@ -1000,7 +1000,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt=
_rq)
  */
 static void update_curr_rt(struct rq *rq)
 {
-	struct task_struct *curr =3D rq->curr;
+	struct task_struct *curr =3D rq_selected(rq);
 	struct sched_rt_entity *rt_se =3D &curr->rt;
 	s64 delta_exec;
=20
@@ -1545,7 +1545,7 @@ static int find_lowest_rq(struct task_struct *task);
 static int
 select_task_rq_rt(struct task_struct *p, int cpu, int flags)
 {
-	struct task_struct *curr;
+	struct task_struct *curr, *selected;
 	struct rq *rq;
 	bool test;
=20
@@ -1557,6 +1557,7 @@ select_task_rq_rt(struct task_struct *p, int cpu, int=
 flags)
=20
 	rcu_read_lock();
 	curr =3D READ_ONCE(rq->curr); /* unlocked access */
+	selected =3D READ_ONCE(rq_selected(rq));
=20
 	/*
 	 * If the current task on @p's runqueue is an RT task, then
@@ -1585,8 +1586,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, int=
 flags)
 	 * systems like big.LITTLE.
 	 */
 	test =3D curr &&
-	       unlikely(rt_task(curr)) &&
-	       (curr->nr_cpus_allowed < 2 || curr->prio <=3D p->prio);
+	       unlikely(rt_task(selected)) &&
+	       (curr->nr_cpus_allowed < 2 || selected->prio <=3D p->prio);
=20
 	if (test || !rt_task_fits_capacity(p, cpu)) {
 		int target =3D find_lowest_rq(p);
@@ -1616,12 +1617,8 @@ select_task_rq_rt(struct task_struct *p, int cpu, in=
t flags)
=20
 static void check_preempt_equal_prio(struct rq *rq, struct task_struct *p)
 {
-	/*
-	 * Current can't be migrated, useless to reschedule,
-	 * let's hope p can move out.
-	 */
 	if (rq->curr->nr_cpus_allowed =3D=3D 1 ||
-	    !cpupri_find(&rq->rd->cpupri, rq->curr, NULL))
+	    !cpupri_find(&rq->rd->cpupri, rq_selected(rq), NULL))
 		return;
=20
 	/*
@@ -1664,7 +1661,9 @@ static int balance_rt(struct rq *rq, struct task_stru=
ct *p, struct rq_flags *rf)
  */
 static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int fl=
ags)
 {
-	if (p->prio < rq->curr->prio) {
+	struct task_struct *curr =3D rq_selected(rq);
+
+	if (p->prio < curr->prio) {
 		resched_curr(rq);
 		return;
 	}
@@ -1682,7 +1681,7 @@ static void wakeup_preempt_rt(struct rq *rq, struct t=
ask_struct *p, int flags)
 	 * to move current somewhere else, making room for our non-migratable
 	 * task.
 	 */
-	if (p->prio =3D=3D rq->curr->prio && !test_tsk_need_resched(rq->curr))
+	if (p->prio =3D=3D curr->prio && !test_tsk_need_resched(rq->curr))
 		check_preempt_equal_prio(rq, p);
 #endif
 }
@@ -1707,7 +1706,7 @@ static inline void set_next_task_rt(struct rq *rq, st=
ruct task_struct *p, bool f
 	 * utilization. We only care of the case where we start to schedule a
 	 * rt task
 	 */
-	if (rq->curr->sched_class !=3D &rt_sched_class)
+	if (rq_selected(rq)->sched_class !=3D &rt_sched_class)
 		update_rt_rq_load_avg(rq_clock_pelt(rq), rq, 0);
=20
 	rt_queue_push_tasks(rq);
@@ -1988,6 +1987,7 @@ static struct task_struct *pick_next_pushable_task(st=
ruct rq *rq)
=20
 	BUG_ON(rq->cpu !=3D task_cpu(p));
 	BUG_ON(task_current(rq, p));
+	BUG_ON(task_current_selected(rq, p));
 	BUG_ON(p->nr_cpus_allowed <=3D 1);
=20
 	BUG_ON(!task_on_rq_queued(p));
@@ -2020,7 +2020,7 @@ static int push_rt_task(struct rq *rq, bool pull)
 	 * higher priority than current. If that's the case
 	 * just reschedule current.
 	 */
-	if (unlikely(next_task->prio < rq->curr->prio)) {
+	if (unlikely(next_task->prio < rq_selected(rq)->prio)) {
 		resched_curr(rq);
 		return 0;
 	}
@@ -2375,7 +2375,7 @@ static void pull_rt_task(struct rq *this_rq)
 			 * p if it is lower in priority than the
 			 * current task on the run queue
 			 */
-			if (p->prio < src_rq->curr->prio)
+			if (p->prio < rq_selected(src_rq)->prio)
 				goto skip;
=20
 			if (is_migration_disabled(p)) {
@@ -2419,9 +2419,9 @@ static void task_woken_rt(struct rq *rq, struct task_=
struct *p)
 	bool need_to_push =3D !task_on_cpu(rq, p) &&
 			    !test_tsk_need_resched(rq->curr) &&
 			    p->nr_cpus_allowed > 1 &&
-			    (dl_task(rq->curr) || rt_task(rq->curr)) &&
+			    (dl_task(rq_selected(rq)) || rt_task(rq_selected(rq))) &&
 			    (rq->curr->nr_cpus_allowed < 2 ||
-			     rq->curr->prio <=3D p->prio);
+			     rq_selected(rq)->prio <=3D p->prio);
=20
 	if (need_to_push)
 		push_rt_tasks(rq);
@@ -2505,7 +2505,7 @@ static void switched_to_rt(struct rq *rq, struct task=
_struct *p)
 		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
 			rt_queue_push_tasks(rq);
 #endif /* CONFIG_SMP */
-		if (p->prio < rq->curr->prio && cpu_online(cpu_of(rq)))
+		if (p->prio < rq_selected(rq)->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
 }
@@ -2520,7 +2520,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p,=
 int oldprio)
 	if (!task_on_rq_queued(p))
 		return;
=20
-	if (task_current(rq, p)) {
+	if (task_current_selected(rq, p)) {
 #ifdef CONFIG_SMP
 		/*
 		 * If our priority decreases while running, we
@@ -2546,7 +2546,7 @@ prio_changed_rt(struct rq *rq, struct task_struct *p,=
 int oldprio)
 		 * greater than the current running task
 		 * then reschedule.
 		 */
-		if (p->prio < rq->curr->prio)
+		if (p->prio < rq_selected(rq)->prio)
 			resched_curr(rq);
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e0e4fc8734b..6ea1dfbe502a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -994,7 +994,10 @@ struct rq {
 	 */
 	unsigned int		nr_uninterruptible;
=20
-	struct task_struct __rcu	*curr;
+	struct task_struct __rcu	*curr;       /* Execution context */
+#ifdef CONFIG_SCHED_PROXY_EXEC
+	struct task_struct __rcu	*curr_selected; /* Scheduling context (policy) */
+#endif
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
@@ -1189,6 +1192,20 @@ DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 #define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
 #define raw_rq()		raw_cpu_ptr(&runqueues)
=20
+#ifdef CONFIG_SCHED_PROXY_EXEC
+#define rq_selected(rq)		((rq)->curr_selected)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	rcu_assign_pointer(rq->curr_selected, t);
+}
+#else
+#define rq_selected(rq)		((rq)->curr)
+static inline void rq_set_selected(struct rq *rq, struct task_struct *t)
+{
+	/* Do nothing */
+}
+#endif
+
 struct sched_group;
 #ifdef CONFIG_SCHED_CORE
 static inline struct cpumask *sched_group_span(struct sched_group *sg);
@@ -2112,11 +2129,25 @@ static inline u64 global_rt_runtime(void)
 	return (u64)sysctl_sched_rt_runtime * NSEC_PER_USEC;
 }
=20
+/*
+ * Is p the current execution context?
+ */
 static inline int task_current(struct rq *rq, struct task_struct *p)
 {
 	return rq->curr =3D=3D p;
 }
=20
+/*
+ * Is p the current scheduling context?
+ *
+ * Note that it might be the current execution context at the same time if
+ * rq->curr =3D=3D rq_selected() =3D=3D p.
+ */
+static inline int task_current_selected(struct rq *rq, struct task_struct =
*p)
+{
+	return rq_selected(rq) =3D=3D p;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2280,7 +2311,7 @@ struct sched_class {
=20
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
-	WARN_ON_ONCE(rq->curr !=3D prev);
+	WARN_ON_ONCE(rq_selected(rq) !=3D prev);
 	prev->sched_class->put_prev_task(rq, prev);
 }
=20
--=20
2.43.0.472.g3155946c3a-goog