[v18] Single RunQueue Proxy Execution (v18)

[RESEND][PATCH v18 6/8] sched: Add an initial sketch of the find_proxy_task() function

Posted by John Stultz 3 months ago

Add a find_proxy_task() function which doesn't do much.

When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.

This allows us to validate keeping blocked tasks on the runqueue
and later deactivating them is working ok, stressing the failure
cases for when a proxy isn't found.

Greatly simplified from patch by:
  Peter Zijlstra (Intel) <peterz@infradead.org>
  Juri Lelli <juri.lelli@redhat.com>
  Valentin Schneider <valentin.schneider@arm.com>
  Connor O'Brien <connoro@google.com>

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
[jstultz: Split out from larger proxy patch and simplified
 for review and testing.]
Signed-off-by: John Stultz <jstultz@google.com>
---
v5:
* Split out from larger proxy patch
v7:
* Fixed unused function arguments, spelling nits, and tweaks for
  clarity, pointed out by Metin Kaya
* Fix build warning Reported-by: kernel test robot <lkp@intel.com>
  Closes: https://lore.kernel.org/oe-kbuild-all/202311081028.yDLmCWgr-lkp@intel.com/
v8:
* Fixed case where we might return a blocked task from find_proxy_task()
* Continued tweaks to handle avoiding returning blocked tasks
v9:
* Add zap_balance_callbacks helper to unwind balance_callbacks
  when we will re-call pick_next_task() again.
* Add extra comment suggested by Metin
* Typo fixes from Metin
* Moved adding proxy_resched_idle earlier in the series, as suggested
  by Metin
* Fix to call proxy_resched_idle() *prior* to deactivating next, to avoid
  crashes caused by stale references to next
* s/PROXY/SCHED_PROXY_EXEC/ as suggested by Metin
* Number of tweaks and cleanups suggested by Metin
* Simplify proxy_deactivate as suggested by Metin
v11:
* Tweaks for earlier simplification in try_to_deactivate_task
v13:
* Rename rename "next" to "donor" in find_proxy_task() for clarity
* Similarly use "donor" instead of next in proxy_deactivate
* Refactor/simplify proxy_resched_idle
* Moved up a needed fix from later in the series
v15:
* Tweaked some comments to better explain the initial sketch of
  find_proxy_task(), suggested by Qais
* Build fixes for !CONFIG_SMP
* Slight rework for blocked_on_state being added later in the
  series.
* Move the zap_balance_callbacks to later in the patch series
v16:
* Move the enqueue_task_rt() out to later in the series, as
  suggested by K Prateek Nayak
* Fixup whitespace error pointed out by K Prateek Nayak
* Use put_prev_set_next_task as suggested by K Prateek Nayak
* Try to rework find_proxy_task() locking to use guard and
  proxy_deactivate_task() in the way Peter suggested.
v17:
* Slightly simplified variable names per suggestion from
  Juri Lelli
* Minor comment and commit message tweaks suggested by Peter

Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: Qais Yousef <qyousef@layalina.io>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Zimuzo Ezeozue <zezeozue@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Metin Kaya <Metin.Kaya@arm.com>
Cc: Xuewen Yan <xuewen.yan94@gmail.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: kuyo chang <kuyo.chang@mediatek.com>
Cc: hupu <hupu.gm@gmail.com>
Cc: kernel-team@android.com
---
 kernel/sched/core.c  | 100 +++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  10 ++++-
 2 files changed, 106 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 853157b27f384..dc82d9b8bee2c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6614,7 +6614,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
  * Otherwise marks the task's __state as RUNNING
  */
 static bool try_to_block_task(struct rq *rq, struct task_struct *p,
-			      unsigned long *task_state_p)
+			      unsigned long *task_state_p,
+			      bool deactivate_cond)
 {
 	unsigned long task_state = *task_state_p;
 	int flags = DEQUEUE_NOCLOCK;
@@ -6625,6 +6626,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 		return false;
 	}
 
+	if (!deactivate_cond)
+		return false;
+
 	p->sched_contributes_to_load =
 		(task_state & TASK_UNINTERRUPTIBLE) &&
 		!(task_state & TASK_NOLOAD) &&
@@ -6648,6 +6652,89 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
 	return true;
 }
 
+#ifdef CONFIG_SCHED_PROXY_EXEC
+static inline struct task_struct *proxy_resched_idle(struct rq *rq)
+{
+	put_prev_set_next_task(rq, rq->donor, rq->idle);
+	rq_set_donor(rq, rq->idle);
+	set_tsk_need_resched(rq->idle);
+	return rq->idle;
+}
+
+static bool __proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+	unsigned long state = READ_ONCE(donor->__state);
+
+	/* Don't deactivate if the state has been changed to TASK_RUNNING */
+	if (state == TASK_RUNNING)
+		return false;
+	/*
+	 * Because we got donor from pick_next_task(), it is *crucial*
+	 * that we call proxy_resched_idle() before we deactivate it.
+	 * As once we deactivate donor, donor->on_rq is set to zero,
+	 * which allows ttwu() to immediately try to wake the task on
+	 * another rq. So we cannot use *any* references to donor
+	 * after that point. So things like cfs_rq->curr or rq->donor
+	 * need to be changed from next *before* we deactivate.
+	 */
+	proxy_resched_idle(rq);
+	return try_to_block_task(rq, donor, &state, true);
+}
+
+static struct task_struct *proxy_deactivate(struct rq *rq, struct task_struct *donor)
+{
+	if (!__proxy_deactivate(rq, donor)) {
+		/*
+		 * XXX: For now, if deactivation failed, set donor
+		 * as unblocked, as we aren't doing proxy-migrations
+		 * yet (more logic will be needed then).
+		 */
+		donor->blocked_on = NULL;
+	}
+	return NULL;
+}
+
+/*
+ * Initial simple sketch that just deactivates the blocked task
+ * chosen by pick_next_task() so we can then pick something that
+ * isn't blocked.
+ */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	struct mutex *mutex;
+
+	mutex = donor->blocked_on;
+	/* Something changed in the chain, so pick again */
+	if (!mutex)
+		return NULL;
+	/*
+	 * By taking mutex->wait_lock we hold off concurrent mutex_unlock()
+	 * and ensure @owner sticks around.
+	 */
+	guard(raw_spinlock)(&mutex->wait_lock);
+
+	/* Check again that donor is blocked with blocked_lock held */
+	if (!task_is_blocked(donor) || mutex != __get_task_blocked_on(donor)) {
+		/*
+		 * Something changed in the blocked_on chain and
+		 * we don't know if only at this level. So, let's
+		 * just bail out completely and let __schedule()
+		 * figure things out (pick_again loop).
+		 */
+		return NULL; /* do pick_next_task() again */
+	}
+	return proxy_deactivate(rq, donor);
+}
+#else /* SCHED_PROXY_EXEC */
+static struct task_struct *
+find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
+{
+	WARN_ONCE(1, "This should never be called in the !SCHED_PROXY_EXEC case\n");
+	return donor;
+}
+#endif /* SCHED_PROXY_EXEC */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -6760,12 +6847,19 @@ static void __sched notrace __schedule(int sched_mode)
 			goto picked;
 		}
 	} else if (!preempt && prev_state) {
-		try_to_block_task(rq, prev, &prev_state);
+		try_to_block_task(rq, prev, &prev_state,
+				  !task_is_blocked(prev));
 		switch_count = &prev->nvcsw;
 	}
 
-	next = pick_next_task(rq, prev, &rf);
+pick_again:
+	next = pick_next_task(rq, rq->donor, &rf);
 	rq_set_donor(rq, next);
+	if (unlikely(task_is_blocked(next))) {
+		next = find_proxy_task(rq, next, &rf);
+		if (!next)
+			goto pick_again;
+	}
 picked:
 	clear_tsk_need_resched(prev);
 	clear_preempt_need_resched();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6b61e0c7f6e78..590a44c0215fb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2290,6 +2290,14 @@ static inline int task_current_donor(struct rq *rq, struct task_struct *p)
 	return rq->donor == p;
 }
 
+static inline bool task_is_blocked(struct task_struct *p)
+{
+	if (!sched_proxy_exec())
+		return false;
+
+	return !!p->blocked_on;
+}
+
 static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
 {
 #ifdef CONFIG_SMP
@@ -2499,7 +2507,7 @@ static inline void put_prev_set_next_task(struct rq *rq,
 					  struct task_struct *prev,
 					  struct task_struct *next)
 {
-	WARN_ON_ONCE(rq->curr != prev);
+	WARN_ON_ONCE(rq->donor != prev);
 
 	__put_prev_set_next_dl_server(rq, prev, next);
 
-- 
2.50.0.727.gbf7dc18ff4-goog

Re: [RESEND][PATCH v18 6/8] sched: Add an initial sketch of the find_proxy_task() function

Posted by Peter Zijlstra 2 months, 4 weeks ago

On Mon, Jul 07, 2025 at 08:43:53PM +0000, John Stultz wrote:

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 853157b27f384..dc82d9b8bee2c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6614,7 +6614,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
>   * Otherwise marks the task's __state as RUNNING
>   */
>  static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> -			      unsigned long *task_state_p)
> +			      unsigned long *task_state_p,
> +			      bool deactivate_cond)
>  {
>  	unsigned long task_state = *task_state_p;
>  	int flags = DEQUEUE_NOCLOCK;
> @@ -6625,6 +6626,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>  		return false;
>  	}
>  
> +	if (!deactivate_cond)
> +		return false;
> +
>  	p->sched_contributes_to_load =
>  		(task_state & TASK_UNINTERRUPTIBLE) &&
>  		!(task_state & TASK_NOLOAD) &&

I'm struggling with this; @deactivate_cond doesn't seem to adequately
cover what it actually does.

So far what it seems to do is when true, don't block. It still does the
signal thing -- but I can't tell if that is actually required or not.

Would 'should_block' be a better name? And maybe stick a little
something in the comment above try_to_block_task() or near the:

	if (!should_block)
		return false;

lines about why the signal bits are important to have done.

> @@ -6648,6 +6652,89 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
>  	return true;
>  }
>  
> +#ifdef CONFIG_SCHED_PROXY_EXEC
> +static inline struct task_struct *proxy_resched_idle(struct rq *rq)
> +{
> +	put_prev_set_next_task(rq, rq->donor, rq->idle);
> +	rq_set_donor(rq, rq->idle);
> +	set_tsk_need_resched(rq->idle);
> +	return rq->idle;
> +}

Nothing cares about the return value.

Re: [RESEND][PATCH v18 6/8] sched: Add an initial sketch of the find_proxy_task() function

Posted by John Stultz 2 months, 4 weeks ago

On Thu, Jul 10, 2025 at 3:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jul 07, 2025 at 08:43:53PM +0000, John Stultz wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 853157b27f384..dc82d9b8bee2c 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6614,7 +6614,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> >   * Otherwise marks the task's __state as RUNNING
> >   */
> >  static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> > -                           unsigned long *task_state_p)
> > +                           unsigned long *task_state_p,
> > +                           bool deactivate_cond)
> >  {
> >       unsigned long task_state = *task_state_p;
> >       int flags = DEQUEUE_NOCLOCK;
> > @@ -6625,6 +6626,9 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> >               return false;
> >       }
> >
> > +     if (!deactivate_cond)
> > +             return false;
> > +
> >       p->sched_contributes_to_load =
> >               (task_state & TASK_UNINTERRUPTIBLE) &&
> >               !(task_state & TASK_NOLOAD) &&
>
> I'm struggling with this; @deactivate_cond doesn't seem to adequately
> cover what it actually does.
>

So initially, there was just this extra logic to exit early if the
prev task was blocked_on so we don't deactivate it (keeping it on the
rq). However, in cases (later in the series) where we later decide to
bail on proxying and just deactivate the task even if we are
blocked_on, I wanted to re-use this for that fallback case, so passing
the !blocked_on state as an argument (which could be forced to true
when we bail) made sense to me. Thus a "deactivation condition".

> So far what it seems to do is when true, don't block. It still does the
> signal thing -- but I can't tell if that is actually required or not.

So, preserving the signal check seemed important, as if we do get a
signal we probably want to make sure we don't deactivate and that the
task so it can be selected to run.
I'm curious as to your thinking that makes you skeptical that it is required?

That said, I appreciate you raising the question, as looking at and
thinking about it some more, I do see that we should be clearing the
blocked_on state when we do set ourselves as runnable in that case,
otherwise if we are blocked_on we might get stuck waiting for a wakeup
(likely from the lock release).

I'll fix that up here.

> Would 'should_block' be a better name? And maybe stick a little
> something in the comment above try_to_block_task() or near the:
>
>         if (!should_block)
>                 return false;
>
> lines about why the signal bits are important to have done.

Sure, I'm happy to use that if should_block is more clear. I'll try to
add some better comments as well.

>
> > @@ -6648,6 +6652,89 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> >       return true;
> >  }
> >
> > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > +static inline struct task_struct *proxy_resched_idle(struct rq *rq)
> > +{
> > +     put_prev_set_next_task(rq, rq->donor, rq->idle);
> > +     rq_set_donor(rq, rq->idle);
> > +     set_tsk_need_resched(rq->idle);
> > +     return rq->idle;
> > +}
>
> Nothing cares about the return value.

Well, nothing in this patch, but the last patch in this series I sent
here uses it.  But if you think it makes the series easier to grok and
I can drop it here and add the return in that later patch.

Thanks so much again for the review and feedback! I really appreciate it!
-john

Re: [RESEND][PATCH v18 6/8] sched: Add an initial sketch of the find_proxy_task() function

Posted by John Stultz 2 months, 3 weeks ago

On Thu, Jul 10, 2025 at 5:43 PM John Stultz <jstultz@google.com> wrote:
> On Thu, Jul 10, 2025 at 3:02 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > So far what it seems to do is when true, don't block. It still does the
> > signal thing -- but I can't tell if that is actually required or not.
...
> That said, I appreciate you raising the question, as looking at and
> thinking about it some more, I do see that we should be clearing the
> blocked_on state when we do set ourselves as runnable in that case,
> otherwise if we are blocked_on we might get stuck waiting for a wakeup
> (likely from the lock release).
>
> I'll fix that up here.

So just FYI, I actually talked myself out of that position last night
before finishing up and sending out v19. If we are blocked_on, we're
not going to get back to userland to handle the signal, we're just
going to wake up, spin through the mutex lock code and head back into
__schedule(), so clearing the blocked_on state isn't the right answer
there. Leaving the blocked_on task on the runqueue (even if we set it
TASK_RUNNABLE) until the blocked_on state is cleared should be fine.

I did some stress testing, and never actually saw a case where we had
signals pending and were blocked on, so it seems like a rare corner
case, and I feel leaving the existing longer-tested behavior is right.

thanks
-john

Re: [RESEND][PATCH v18 6/8] sched: Add an initial sketch of the find_proxy_task() function

Posted by Peter Zijlstra 2 months, 4 weeks ago

On Thu, Jul 10, 2025 at 05:43:36PM -0700, John Stultz wrote:

> > > @@ -6648,6 +6652,89 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> > >       return true;
> > >  }
> > >
> > > +#ifdef CONFIG_SCHED_PROXY_EXEC
> > > +static inline struct task_struct *proxy_resched_idle(struct rq *rq)
> > > +{
> > > +     put_prev_set_next_task(rq, rq->donor, rq->idle);
> > > +     rq_set_donor(rq, rq->idle);
> > > +     set_tsk_need_resched(rq->idle);
> > > +     return rq->idle;
> > > +}
> >
> > Nothing cares about the return value.
> 
> Well, nothing in this patch, but the last patch in this series I sent
> here uses it.  But if you think it makes the series easier to grok and
> I can drop it here and add the return in that later patch.

I found it in the later patch; I meant to reply here, but promptly
forgot. It can stay here I suppose, no harm done.

[RESEND][PATCH v18 1/8] sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
[RESEND][PATCH v18 2/8] locking/mutex: Rework task_struct::blocked_on
[RESEND][PATCH v18 3/8] locking/mutex: Add p->blocked_on wrappers for correctness checks
[RESEND][PATCH v18 4/8] sched: Move update_curr_task logic into update_curr_se
[RESEND][PATCH v18 5/8] sched: Fix runtime accounting w/ split exec & sched contexts
[RESEND][PATCH v18 6/8] sched: Add an initial sketch of the find_proxy_task() function
[RESEND][PATCH v18 7/8] sched: Fix proxy/current (push,pull)ability
[RESEND][PATCH v18 8/8] sched: Start blocked_on chain processing in find_proxy_task()